ETX2250/ETF5922

Data Visualisations in R

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics



Today’s Lecture

Learning Objectives

  • Introduce you to the grammar of graphics

  • Learn how to create plots in R

  • Learn about plots to show distributions, correlations and relationships (Covered ranking plots last lecture)

Plotting in R

Base R Plotting

Base R Plotting

There are basic plotting functions in R that don’t require any packages.

  • Examples of common base R plotting functions:

    • plot() for scatterplots.
    • barplot() for bar charts.
    • hist() for histograms.
    • check these out using the help menu ?plot()
  • But, there is limited flexibility for complex or layered plots

  • Base R plots are much harder to customise

Enter ggplot2

ggplot2

ggplot2 is part of the tidyverse, a collection of R packages great for data analytics and data science.

  • Offers a flexible approach to creating visually appealing graphics

  • A powerful and flexible tool for creating layered, customisable plots.

  • The “2” in ggplot2?

    • It’s the second iteration of the ggplot package, created by Hadley Wickham.

    • ggplot2 improved upon the original package with more features and better usability.

The grammar of graphics

Grammar of graphics

Part of what makes ggplot2 so powerful is it built on the ideas of Grammar of Graphics a text by Leland Wilkinson.

  • The grammar of graphics breaks down visualisations into individual pieces that you layer together

  • Basically creating a set of rules for creating almost any graphic

  • At first using ggplot2 will seem complicated

  • Once mastered you can use it to easily create detailed plots

Getting started in ggplot2

Shrek

Shrek and ggplot2

Don’t be scared of ggplot2, it’s just like Shrek!

Get to know it before you judge it! ❤️

“Ogres have layers. Onions have layers. You get it? We both have layers” - Shrek

The Layers

Key layers include:

  • Data:
    • The dataset you’re visualising.
  • Aesthetic Mappings (aes() for short):
    • Map variables to visual properties like x, y, color, size, etc.
  • Geometries (geom_*):
    • Define the type of plot (e.g., bars, lines, points).
  • Scales:
    • Control how data maps to aesthetics (e.g., axis limits, color gradients).
  • Facets:
    • Split the data into multiple panels (e.g., facet_wrap()).
  • Themes:
    • Customise the non-data components (e.g., background, grid lines).

Base Layer

Start by creating an empty plot on which to add your layers. We’ll add layers to this plot using +.

library(ggplot2)

ggplot()

Data Layer

Note

  • First step is to add our data

  • I’m going to use this data set I prepared earlier

boston_celtics = read_csv(here::here("data/boston_celtics.csv")) |> 
   filter(season == 2025)
head(boston_celtics)
# A tibble: 6 × 57
    game_id season season_type game_date  game_date_time      team_id team_uid  
      <dbl>  <dbl>       <dbl> <date>     <dttm>                <dbl> <chr>     
1 401703390   2025           2 2024-11-19 2024-11-20 00:00:00       2 s:40~l:46…
2 401704796   2025           2 2024-11-16 2024-11-17 01:00:00       2 s:40~l:46…
3 401704784   2025           2 2024-11-13 2024-11-14 00:30:00       2 s:40~l:46…
4 401703370   2025           2 2024-11-12 2024-11-13 00:00:00       2 s:40~l:46…
5 401704768   2025           2 2024-11-10 2024-11-10 20:30:00       2 s:40~l:46…
6 401704753   2025           2 2024-11-08 2024-11-09 00:30:00       2 s:40~l:46…
# ℹ 50 more variables: team_slug <chr>, team_location <chr>, team_name <chr>,
#   team_abbreviation <chr>, team_display_name <chr>,
#   team_short_display_name <chr>, team_color <chr>,
#   team_alternate_color <chr>, team_logo <chr>, team_home_away <chr>,
#   team_score <dbl>, team_winner <lgl>, assists <dbl>, blocks <dbl>,
#   defensive_rebounds <dbl>, fast_break_points <dbl>, field_goal_pct <dbl>,
#   field_goals_made <dbl>, field_goals_attempted <dbl>, …

Reading Data

Checklist

  • Remember you need to tell R where to look for this file by setting your working directory

  • Setting up a R Project helps with this!

  • Check you current working directory with getwd()

  • Check your file is in this directory list.files()

  • After your read in your data look at it using View() of head()

  • Make sure it looks like what you expect

  • Also check the structure of your data str()

  • If all that seems good - we can add it to our plot!

Add you data layer

It’s still an empty plot because we haven’t told R what to do with the data yet.

ggplot(data = boston_celtics) 

Geometry Layer (geom)

geom

  • Let’s create a coloumn plot

  • Use the geometry layer - geom_col

  • Similar to geom_bar (but does slightly different things)

  • If you type ?geom_ in your Console and hit tab to scroll through a list of all the different plot geometries

  • Think of all these types is like the Visualisation Pane in Power BI

Bar Plot

Add your geom

This is what your code should look like when you add your geom layer

ggplot(data = boston_celtics) + 
  geom_col()

Warning

  • But … this code won’t work yet, because we haven’t added our aesthetic layer

  • The aesthetic layer defines how data is mapped to visual properties in your plot

    • e.g what goes on the x/y axes

Aesthetic Layer

Common Aesthetic Mappings

Use the aes() function to map variables to aesthetics.

The common parts are:

  • x: The variable on the x-axis.

  • y: The variable on the y-axis.

  • color: The color of points, lines, or outlines.

  • fill: The fill color for bars, areas, or shapes.

  • size: The size of points or lines.

  • shape: The shape of points (e.g., circles, triangles).

  • alpha: The transparency level.

Adding the aesthetic layer

ggplot(data = boston_celtics, aes(x = game_date, y = team_score)) + 
  geom_col() 

Another Option

Important

If you are going to use multiple data types or need multiple aesthetics layers it is better to put the code about the data and the aesthetics in the geom layer directly.

ggplot() + 
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score)) 

Colour and Fill

Set the bar colour to Green

ggplot() + 
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score), col = "forestgreen") 

Colour and Fill

Set the bar fill to Green

ggplot() + 
  geom_col(data = boston_celtics, aes(x = game_date, y = team_score), fill = "forestgreen") 

Scale Layer

Next layer is the visual elements is scale. e.g. axis limits and color scales

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner))

Colour and Fill In Code

Common misunderstandings

  • If it the asethetic mapping is the name of a variable then you need to put it in the the aes() brackets

  • If it is fixed, e.g. you want to colour everything black, then it is just in the geom_*() bracket.

  • Depending on what geom you use, there may be a difference between colour and fill

  • Both spellings of colour and color will work

Using In Built Fill/Colour Scales

You can use the inbuilt palettes from RColourBrewer

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") + 
  scale_fill_brewer(palette = "Greens") 

Using Manual Fill/Colour Scales

You can also change fill/colour scales manually using scale_colour_manual or scale_fill_manual.

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen"))

Colour Scales

Colour Scales

IMO: Colour scales are on the hardest parts about learning ggplot2

  • To change colour scale, use scale_colour_*

  • To change fill scale, use scale_fill_*

  • Check out all the different types of scales using the help menu ?scale_colour and hit tab.

  • Note for discrete variables needing distinct colours, such as categorical variables, you can use scale_*_brewer

  • For variables needing a smooth gradient use scale_*_distiller

  • You can also set colours manually using scale_*_manual

Note * here is like a blank space and it means there are multiple things that could be inserted here

Themes

Themes

  • Here is a list of the themes.

  • My favourite is theme_bw().

Default Theme

Grey Background

  • The default background for ggplot2 is arguably chartjunk.

  • But - There are good reasons for using it.

“We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily”tune” them out… Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.” -Wickham on the grey background
Source: ggplot2: Elegant Graphics for Data Analysis.

Changing Theme Background

Here I change the theme background to theme_bw().

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() 

Plot Theme Specifics

Plot Theme Specifics

  • To tune the more specific aspects of your theme, we use the theme() layer.

  • Look up ?theme there are a lot of options!

Changing Theme Specifics

Here I move the legend to the bottom and remove the legend label.

ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Loss", "Win"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() + 
  theme(legend.position = "bottom", legend.title = element_blank())

Polising your plot

The theme() layer is also were you can specifics about titles, text and axes. You could also change label names in the theme, but xlab, ylab and ggtitle are easier to use.

celtics_plot <- ggplot(data = boston_celtics) + 
  geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
  scale_fill_manual(label = c("Win", "Loss"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) + 
  theme_bw() + 
  theme(
    legend.position = "bottom", 
    legend.title = element_blank(),
    plot.title = element_text(size = 20), 
    axis.title.x = element_text(size = 15),  
    axis.title.y = element_text(size = 15),  
    axis.text = element_text(size = 12),      
    legend.text = element_text(size = 12)) + 
  xlab("Game Date") + 
  ylab("Team Score") + 
  ggtitle("Boston Celtics 2025 Season so far")

Final plot

celtics_plot

Your turn

Your turn

Recreate this plot - But experiment with the different parts to see how they work!

  • Change the fixed fill and the colour

  • Change the colour/fill palette

  • Experiment with different ranges of the y-axes (x is tricky so don’t worry about that for now.)

  • Change the titles and axes labels

  • Change the theme

Do these one at a time so you understand what each piece of code does!

Distribution Plots

Distribution Plots

Distribution Plots

Instead of comparing the points scored per game, I might like to look at the distribution of points scored in a game.

  • What’s the highest number of points scored?
  • What’s the lowest number of points scored?
  • What’s the average number of points scored?
  • How much variation is their in the number of points scored each game?

Histogram

Too many bins:

Histogram

Too few bins:

Histogram

A better number of bins:

Density Plot

Box Plot

When to Use: Histogram

Histogram

  • Use to visualise the distribution of a single numeric variable.
  • Good for identifying the shape of data (e.g., normal, skewed, bimodal).
  • Example: Examining the frequency of income ranges.

When to Use: Density Plot

Density Plot

  • Use as a smoothed alternative to a histogram.
  • Best for highlighting the underlying distribution without binning.
  • Example: Distribution of test scores.

When to Use: Boxplot

Boxplot

  • Use to summarize the central tendency and spread of a numeric variable.
  • Provides a 5 number summary
    • min, first quartile (Q1), median, third quartile (Q3), max
  • Can be used to show outliers compared to the main data distribution
  • Example: Summarising salary bands

A guide

Choosing the Right Plot

Plot Type Best For Features
Histogram Raw frequency counts Splits the data into bins
Density Plot Smoothed Distribution Continuous, no bins
Box plot Showing the 5 number Summary Shows outliers as points

Common Mistakes to Avoid

Histograms

Choosing Poor Bin Sizes
- Too few bins: Hides important details.
- Too many bins: Creates noise and over complicates the plot.

Common Mistakes to Avoid

Density Plots

Over-Smoothing or Under-Smoothing
- Over-smoothing hides important features like multiple peaks.
- Under-smoothing makes the plot noisy.

Inappropriate Use for Small Datasets
- Density plots require sufficient data points for meaningful results.

Common Mistakes to Avoid

Boxplots

Careful about Outliers
- Outliers are not necessarily errors; they may reflect valid data points.
- Best to leave them in, unless you have a good reason otherwise
- R automatically plots outliers as points, if points are 1.5 times greater than the interquartile range.

Ignoring the Context of the Data
- A boxplot only shows a summary - If the shape of the underlying distribution is important, best to use something else.

Key Takeaways

Important

  • Pick the appropriate plot for you data

  • In the previous example, as we only have a few games for the 2025 season a box plot might be best as it gives a high level summary and there isn’t enough data to warrant a more detailed plot.

  • If we had all the historical game data, then a histogram or a density plot would be the better choice to visualise the data.

Histogram and Density

Code

boston_celtics = read_csv("data/boston_celtics.csv")

ggplot(data = boston_celtics, aes(x = team_score)) + 
  geom_histogram(aes(y = ..density..), 
                 alpha = 0.4, 
                 col = "lightgray") + 
  geom_density(col = "red", 
               adjust = 1.25, 
               size = 1.1) +
  geom_vline(aes(xintercept = mean(team_score)), 
             col = "red", 
             linetype = "dotted", 
             size = 1.1) +
  theme_bw() + 
  ggtitle("Boston Celtics Team Score 2021 - Present")

Plots showing relationships

Correlation and Relationships

Plots showing correlation and realtionships

  • Scatter Plot (geom_point)
  • Bubble Plots (size)
  • Line Plots (geom_line)
  • Heatmaps (geom_hex, geom_bin2d)
  • Contour Plots (geom_density_2d)

And their variations!

Scatter Plot

Good for looking a relationships between 2 numeric variables

field_goals_scatter_plot <- ggplot(data = boston_celtics) + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made)) + 
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2021 - Present")

Scatter Plot

field_goals_scatter_plot

Bubble Plot

Can be used to look at relationships between 3 numeric variables. Good when you want to show differences in “size”.

field_goals_bubble_plot <-
  boston_celtics |>
  filter(season == 2025) |> 
  ggplot() + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) + 
 scale_size_continuous(
    name = "Team Score",
    breaks = c(110, 115, 120, 125, 130)
  ) +
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2025 Season") + 
  coord_fixed()

Bubble Plot

field_goals_bubble_plot

Bubble Plot ctd

One thing to watch out for in both scatter and bubble plots is overplotting. Changing transparency helped in the last plot, but sometimes there are too many points!

field_goals_bubble_plot_all <-
  ggplot(data = boston_celtics) + 
  geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) + 
 scale_size_continuous(
    name = "Team Score",
    breaks = c(110, 115, 120, 125, 130)
  ) +
  theme_bw() + 
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  theme(legend.position = "bottom") + 
  ggtitle("Boston Celtics Field Goals 2021 - Present") + 
  coord_fixed()

Bubble Plot ctd

This isn’t very helpful!

field_goals_bubble_plot_all

Heatmap and Density

field_goals_heatmap_all <-
  ggplot(data = boston_celtics) + 
  geom_hex(
    aes(
    x = field_goals_attempted, 
    y = field_goals_made
    )) + 
  scale_fill_gradient(name = "Count") +
  geom_density_2d(
    aes(
    x = field_goals_attempted, 
    y = field_goals_made
    ), alpha = 0.4, adjust = 1.2) +
  theme_bw() +
  xlab("Field Goals Attempted (2 pts)") + 
  ylab("Field Goals Made (2 pts)") + 
  ggtitle("Boston Celtics Field Goals 2021 - 2024") + 
  coord_fixed()

Heatmap and Density

field_goals_heatmap_all

Summary

Summary

Summary

  • Learnt about plotting in R.

    • This included how to use the ggplot2 package in R, and

    • The grammar of graphics

  • Learnt ggplot2 is like Shrek!

    • It is an onion with many layers.
  • Learnt how to create a range of plots in R

    • These include common plots to show correlations and relationships
  • Remember the tips for learning R from Lecture 1 if you get stuck!

Material developed by Dr. Kate Saunders