Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
Learning Objectives
Introduce you to the grammar of graphics
Learn how to create plots in R
Learn about plots to show distributions, correlations and relationships (Covered ranking plots last lecture)
Base R Plotting
There are basic plotting functions in R that don’t require any packages.
Examples of common base R plotting functions:
plot()
for scatterplots.barplot()
for bar charts.hist()
for histograms.?plot()
But, there is limited flexibility for complex or layered plots
Base R plots are much harder to customise
ggplot2
ggplot2 is part of the tidyverse, a collection of R packages great for data analytics and data science.
Offers a flexible approach to creating visually appealing graphics
A powerful and flexible tool for creating layered, customisable plots.
The “2” in ggplot2?
It’s the second iteration of the ggplot package, created by Hadley Wickham.
ggplot2 improved upon the original package with more features and better usability.
Grammar of graphics
Part of what makes ggplot2
so powerful is it built on the ideas of Grammar of Graphics a text by Leland Wilkinson.
The grammar of graphics breaks down visualisations into individual pieces that you layer together
Basically creating a set of rules for creating almost any graphic
At first using ggplot2
will seem complicated
Once mastered you can use it to easily create detailed plots
Shrek and ggplot2
Don’t be scared of ggplot2
, it’s just like Shrek!
Get to know it before you judge it! ❤️
“Ogres have layers. Onions have layers. You get it? We both have layers” - Shrek
Key layers include:
aes()
for short):
geom_*
):
facet_wrap()
).Start by creating an empty plot on which to add your layers. We’ll add layers to this plot using +
.
Note
First step is to add our data
I’m going to use this data set I prepared earlier
boston_celtics = read_csv(here::here("data/boston_celtics.csv")) |>
filter(season == 2025)
head(boston_celtics)
# A tibble: 6 × 57
game_id season season_type game_date game_date_time team_id team_uid
<dbl> <dbl> <dbl> <date> <dttm> <dbl> <chr>
1 401703390 2025 2 2024-11-19 2024-11-20 00:00:00 2 s:40~l:46…
2 401704796 2025 2 2024-11-16 2024-11-17 01:00:00 2 s:40~l:46…
3 401704784 2025 2 2024-11-13 2024-11-14 00:30:00 2 s:40~l:46…
4 401703370 2025 2 2024-11-12 2024-11-13 00:00:00 2 s:40~l:46…
5 401704768 2025 2 2024-11-10 2024-11-10 20:30:00 2 s:40~l:46…
6 401704753 2025 2 2024-11-08 2024-11-09 00:30:00 2 s:40~l:46…
# ℹ 50 more variables: team_slug <chr>, team_location <chr>, team_name <chr>,
# team_abbreviation <chr>, team_display_name <chr>,
# team_short_display_name <chr>, team_color <chr>,
# team_alternate_color <chr>, team_logo <chr>, team_home_away <chr>,
# team_score <dbl>, team_winner <lgl>, assists <dbl>, blocks <dbl>,
# defensive_rebounds <dbl>, fast_break_points <dbl>, field_goal_pct <dbl>,
# field_goals_made <dbl>, field_goals_attempted <dbl>, …
Checklist
Remember you need to tell R where to look for this file by setting your working directory
Setting up a R Project helps with this!
Check you current working directory with getwd()
Check your file is in this directory list.files()
After your read in your data look at it using View()
of head()
Make sure it looks like what you expect
Also check the structure of your data str()
If all that seems good - we can add it to our plot!
It’s still an empty plot because we haven’t told R what to do with the data yet.
geom
Let’s create a coloumn plot
Use the geometry layer - geom_col
Similar to geom_bar
(but does slightly different things)
If you type ?geom_
in your Console and hit tab to scroll through a list of all the different plot geometries
Think of all these types is like the Visualisation Pane in Power BI
Add your geom
This is what your code should look like when you add your geom layer
Warning
But … this code won’t work yet, because we haven’t added our aesthetic layer
The aesthetic layer defines how data is mapped to visual properties in your plot
Common Aesthetic Mappings
Use the aes()
function to map variables to aesthetics.
The common parts are:
x: The variable on the x-axis.
y: The variable on the y-axis.
color: The color of points, lines, or outlines.
fill: The fill color for bars, areas, or shapes.
size: The size of points or lines.
shape: The shape of points (e.g., circles, triangles).
alpha: The transparency level.
Important
If you are going to use multiple data types or need multiple aesthetics layers it is better to put the code about the data and the aesthetics in the geom layer directly.
Set the bar colour to Green
Set the bar fill to Green
Next layer is the visual elements is scale. e.g. axis limits and color scales
Common misunderstandings
If it the asethetic mapping is the name of a variable then you need to put it in the the aes()
brackets
If it is fixed, e.g. you want to colour everything black, then it is just in the geom_*()
bracket.
Depending on what geom you use, there may be a difference between colour and fill
Both spellings of colour and color will work
You can use the inbuilt palettes from RColourBrewer
You can also change fill/colour scales manually using scale_colour_manual
or scale_fill_manual
.
Colour Scales
IMO: Colour scales are on the hardest parts about learning ggplot2
To change colour scale, use scale_colour_*
To change fill scale, use scale_fill_*
Check out all the different types of scales using the help menu ?scale_colour
and hit tab.
Note for discrete variables needing distinct colours, such as categorical variables, you can use scale_*_brewer
For variables needing a smooth gradient use scale_*_distiller
You can also set colours manually using scale_*_manual
Note * here is like a blank space and it means there are multiple things that could be inserted here
Themes
Here is a list of the themes.
My favourite is theme_bw()
.
Grey Background
The default background for ggplot2
is arguably chartjunk.
But - There are good reasons for using it.
“We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily”tune” them out… Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.” -Wickham on the grey background
Source: ggplot2: Elegant Graphics for Data Analysis.
Here I change the theme background to theme_bw()
.
Plot Theme Specifics
To tune the more specific aspects of your theme, we use the theme()
layer.
Look up ?theme
there are a lot of options!
Here I move the legend to the bottom and remove the legend label.
The theme()
layer is also were you can specifics about titles, text and axes. You could also change label names in the theme, but xlab
, ylab
and ggtitle
are easier to use.
celtics_plot <- ggplot(data = boston_celtics) +
geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
scale_fill_manual(label = c("Win", "Loss"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) +
theme_bw() +
theme(
legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(size = 20),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text = element_text(size = 12),
legend.text = element_text(size = 12)) +
xlab("Game Date") +
ylab("Team Score") +
ggtitle("Boston Celtics 2025 Season so far")
Your turn
Recreate this plot - But experiment with the different parts to see how they work!
Change the fixed fill and the colour
Change the colour/fill palette
Experiment with different ranges of the y-axes (x is tricky so don’t worry about that for now.)
Change the titles and axes labels
Change the theme
Do these one at a time so you understand what each piece of code does!
Distribution Plots
Instead of comparing the points scored per game, I might like to look at the distribution of points scored in a game.
Too many bins:
Too few bins:
A better number of bins:
Histogram
Density Plot
Boxplot
Choosing the Right Plot
Plot Type | Best For | Features |
---|---|---|
Histogram | Raw frequency counts | Splits the data into bins |
Density Plot | Smoothed Distribution | Continuous, no bins |
Box plot | Showing the 5 number Summary | Shows outliers as points |
Histograms
Choosing Poor Bin Sizes
- Too few bins: Hides important details.
- Too many bins: Creates noise and over complicates the plot.
Density Plots
Over-Smoothing or Under-Smoothing
- Over-smoothing hides important features like multiple peaks.
- Under-smoothing makes the plot noisy.
Inappropriate Use for Small Datasets
- Density plots require sufficient data points for meaningful results.
Boxplots
Careful about Outliers
- Outliers are not necessarily errors; they may reflect valid data points.
- Best to leave them in, unless you have a good reason otherwise
- R automatically plots outliers as points, if points are 1.5 times greater than the interquartile range.
Ignoring the Context of the Data
- A boxplot only shows a summary - If the shape of the underlying distribution is important, best to use something else.
Important
Pick the appropriate plot for you data
In the previous example, as we only have a few games for the 2025 season a box plot might be best as it gives a high level summary and there isn’t enough data to warrant a more detailed plot.
If we had all the historical game data, then a histogram or a density plot would be the better choice to visualise the data.
boston_celtics = read_csv("data/boston_celtics.csv")
ggplot(data = boston_celtics, aes(x = team_score)) +
geom_histogram(aes(y = ..density..),
alpha = 0.4,
col = "lightgray") +
geom_density(col = "red",
adjust = 1.25,
size = 1.1) +
geom_vline(aes(xintercept = mean(team_score)),
col = "red",
linetype = "dotted",
size = 1.1) +
theme_bw() +
ggtitle("Boston Celtics Team Score 2021 - Present")
Plots showing correlation and realtionships
geom_point
)size
)geom_line
)geom_hex
, geom_bin2d
)geom_density_2d
)And their variations!
Good for looking a relationships between 2 numeric variables
field_goals_scatter_plot <- ggplot(data = boston_celtics) +
geom_point(aes(x = field_goals_attempted, y = field_goals_made)) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2021 - Present")
Can be used to look at relationships between 3 numeric variables. Good when you want to show differences in “size”.
field_goals_bubble_plot <-
boston_celtics |>
filter(season == 2025) |>
ggplot() +
geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) +
scale_size_continuous(
name = "Team Score",
breaks = c(110, 115, 120, 125, 130)
) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2025 Season") +
coord_fixed()
One thing to watch out for in both scatter and bubble plots is overplotting. Changing transparency helped in the last plot, but sometimes there are too many points!
field_goals_bubble_plot_all <-
ggplot(data = boston_celtics) +
geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) +
scale_size_continuous(
name = "Team Score",
breaks = c(110, 115, 120, 125, 130)
) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2021 - Present") +
coord_fixed()
This isn’t very helpful!
field_goals_heatmap_all <-
ggplot(data = boston_celtics) +
geom_hex(
aes(
x = field_goals_attempted,
y = field_goals_made
)) +
scale_fill_gradient(name = "Count") +
geom_density_2d(
aes(
x = field_goals_attempted,
y = field_goals_made
), alpha = 0.4, adjust = 1.2) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
ggtitle("Boston Celtics Field Goals 2021 - 2024") +
coord_fixed()
Summary
Learnt about plotting in R.
This included how to use the ggplot2
package in R, and
The grammar of graphics
Learnt ggplot2
is like Shrek!
Learnt how to create a range of plots in R
Remember the tips for learning R from Lecture 1 if you get stuck!
Material developed by Dr. Kate Saunders
ETX2250/ETF5922