Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
Learning Objectives
Introduce you to the grammar of graphics
Learn how to create plots in R
Learn about plots to show distributions, correlations and relationships
Builds on plots showing rankings/comparisons of categoric variables from last lecture
Base R Plotting
There are basic plotting functions in R that don’t require any packages.
Examples of common base R plotting functions:
plot() for scatterplots.barplot() for bar charts.hist() for histograms.?plot()But, there is limited flexibility for complex or layered plots
Base R plots are much harder to customise
ggplot2
ggplot2 is part of the tidyverse, a collection of R packages great for data analytics and data science.
A powerful and flexible tool for creating layered, customisable plots.
The “2” in ggplot2?
It’s the second iteration of the ggplot package, created by Hadley Wickham.
ggplot2 improved upon the original package with more features and better usability.
Grammar of graphics
Part of what makes ggplot2 so powerful is it built on the ideas of Grammar of Graphics a text by Leland Wilkinson.
The grammar of graphics breaks down visualisations into individual pieces that you layer together
Basically creating a set of rules for creating almost any graphic
At first using ggplot2 will seem complicated
But master the grammar and you’ll create detailed plots easily
Shrek and ggplot2
Don’t be scared of ggplot2, it’s just like Shrek!
Get to know it before you judge it!
“Ogres have layers. Onions have layers. You get it? We both have layers” - Shrek
Key layers include:
aes() for short):
geom_*):
facet_wrap(), facet_grid()).Start by creating an empty plot on which to add your layers. We’ll add layers to this plot using +.
What to plot
First step is to add our data
I’m going to use this data set I prepared earlier
library(here)
all_years_boston_celtics = read_csv(here::here("data/boston_celtics.csv"))
boston_celtics = all_years_boston_celtics |>
filter(season == 2025)
head(boston_celtics)# A tibble: 6 × 57
game_id season season_type game_date game_date_time team_id team_uid
<dbl> <dbl> <dbl> <date> <dttm> <dbl> <chr>
1 401703390 2025 2 2024-11-19 2024-11-20 00:00:00 2 s:40~l:46…
2 401704796 2025 2 2024-11-16 2024-11-17 01:00:00 2 s:40~l:46…
3 401704784 2025 2 2024-11-13 2024-11-14 00:30:00 2 s:40~l:46…
4 401703370 2025 2 2024-11-12 2024-11-13 00:00:00 2 s:40~l:46…
5 401704768 2025 2 2024-11-10 2024-11-10 20:30:00 2 s:40~l:46…
6 401704753 2025 2 2024-11-08 2024-11-09 00:30:00 2 s:40~l:46…
# ℹ 50 more variables: team_slug <chr>, team_location <chr>, team_name <chr>,
# team_abbreviation <chr>, team_display_name <chr>,
# team_short_display_name <chr>, team_color <chr>,
# team_alternate_color <chr>, team_logo <chr>, team_home_away <chr>,
# team_score <dbl>, team_winner <lgl>, assists <dbl>, blocks <dbl>,
# defensive_rebounds <dbl>, fast_break_points <dbl>, field_goal_pct <dbl>,
# field_goals_made <dbl>, field_goals_attempted <dbl>, …
Checklist
Remember you need to tell R where to look for this file by setting your working directory
Setting up a R Project helps with this!
Check you current working directory with getwd()
Check your file is in this directory list.files()
After your read in your data look at it using View() of head()
Make sure it looks like what you expect
Also check what type the computer thinks your data is: use str(), class() or summary()
If all that seems good - we can add it to our plot!
It’s still an empty plot because we haven’t told R what to do with the data yet.
geom
If you type ?geom_ in your Console and hit tab to scroll through a list of all the different plot geometries
Think of all these types is like the Visualisation Pane in Power BI
Our turn
Let’s create a bar plot showing the team_ score each game this season
Use the geometry layer - geom_col
Similar to geom_bar (but does slightly different things)
Add the geom
This is what your code should look like when you add your geom layer
Warning
But … this code won’t work yet - we haven’t added our aesthetic layer
The aesthetic layer defines how data is mapped to visual properties in your plot
Common Aesthetic Mappings
Use the aes() function to map variables to aesthetics.
The common parts are:
x: The variable on the x-axis.
y: The variable on the y-axis.
color: The color of points, lines, or outlines.
fill: The fill color for bars, areas, or shapes.
size: The size of points or lines.
shape: The shape of points (e.g., circles, triangles).
alpha: The transparency level.
Important
When using only one data source you can put the aes() mapping as input into ggplot2()
For multiple data sources - the data and aesthetics mapping should go in the geom layer directly.
Colours
There are 657 named colours in R e.g. “red”, “blue”, “yellow”
To see theam all run colors()
Colours can also be specified using hex codes, e.g. "forestgreen" is "#228B22".
Colour vs Fill
Let’s set the bar colour to Green
Colour vs Fill
Let’s set the bar fill to Green
Next Layer
Now we are happy with our:
we can work on the scale layer. e.g. axis limits and color scales
Scale mapping
To map a variable to the colour or fill, we need to specify the mapping in aes().
There are many inbuilt palettes, see RColourBrewer
You can also change fill/colour scales manually using scale_colour_manual or scale_fill_manual.
Colour Scales
IMO: Colour scales are on the hardest parts about learning ggplot2
To change colour scale, use scale_colour_*
To change fill scale, use scale_fill_*
Check out all the different types of scales using the help menu ?scale_colour and hit tab.
Note for discrete variables needing distinct colours, such as categorical variables, you can use scale_*_brewer
For continuous variables needing a smooth gradient use scale_*_distiller
You can also set colours manually using scale_*_manual
Note * here is like a blank space and it means there are multiple things that could be inserted here
Themes
Here is a list of the themes.
My favourite is theme_bw().
Grey Background
The default background for ggplot2 is arguably chartjunk.
But - There are good reasons for using it.
“We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily”tune” them out… Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.”
Wickham on the grey background* Source: ggplot2: Elegant Graphics for Data Analysis.
Here I change the theme background to theme_bw().
Plot Theme Specifics
To tune the more specific aspects of your theme, we use the theme() layer.
Look up ?theme there are a lot of options!
To improve data-density we can move the legend to the bottom and remove the legend label.
The theme() layer is also were you can add specifics about titles, text and axes.
celtics_plot <- ggplot(data = boston_celtics) +
geom_col(aes(x = game_date, y = team_score, fill = team_winner), col = "gray") +
scale_fill_manual(label = c("Win", "Loss"), values = c("TRUE" = "forestgreen", "FALSE" = "lightgreen")) +
theme_bw() +
theme(
legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(size = 20),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text = element_text(size = 12),
legend.text = element_text(size = 12)) +
xlab("Game Date") +
ylab("Team Score") +
ggtitle("Boston Celtics 2025 Season so far")Your turn
Recreate this plot - But experiment with the different parts to see how they work!
Change the fixed fill and the colour
Change the colour/fill palette
Experiment with different ranges of the y-axes (x is tricky so don’t worry about that for now.)
Change the titles and axes labels
Change the theme
Do these one at a time so you understand what each piece of code does!
Distribution Plots
Instead of comparing the points scored per game, I might like to look at the distribution of points scored in a game.
Too many bins:
Too few bins:
A better number of bins:
When to Use
Watch out for
When to use
Watch out for
When to Use
Watch out for
Outliers are necessarily errors! Leave them in, unless you have a good reason.
R automatically plots points as outliers if they are 1.5 times greater than the interquartile range
A boxplot is a summary. If the shape of the underlying distribution is important, try geom_violin()
Choosing the Right Plot
| Plot Type | Best For | Features |
|---|---|---|
| Histogram | Raw frequency counts | Splits the data into bins |
| Density Plot | Smoothed Distribution | Continuous, no bins |
| Box plot | Showing the 5 number Summary | Shows outliers as points |
Key Takeaways
Pick the appropriate plot for you data and the relationship you want to show
Key Takeaways
And sometimes the best plot combines more than one geometry togehter!
boston_celtics = read_csv("data/boston_celtics.csv")
ggplot(data = boston_celtics, aes(x = team_score)) +
geom_histogram(aes(y = ..density..),
alpha = 0.4,
col = "lightgray") +
geom_density(col = "red",
adjust = 1.25,
size = 1.1) +
geom_vline(aes(xintercept = mean(team_score)),
col = "red",
linetype = "dotted",
size = 1.1) +
theme_bw() +
ggtitle("Boston Celtics Team Score 2021 - Present")Plots showing correlation and realtionships
geom_point)size)geom_line)geom_hex, geom_bin2d)geom_density_2d)And their variations!
Scatter Plot
Good for looking a relationships between two numeric variables
field_goals_scatter_plot <- ggplot(data = all_years_boston_celtics) +
geom_point(aes(x = field_goals_attempted, y = field_goals_made)) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2021 - Present")Say I want to show the total point score as well or which games they won?
Back to preattentive processing
field_goals_colour_plot <-
all_years_boston_celtics |>
ggplot() +
geom_point(aes(x = field_goals_attempted, y = field_goals_made, col = team_score), alpha = 0.4) +
scale_colour_distiller(palette = "Greens") +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2021 - Present") +
coord_fixed()field_goals_bubble_plot <-
all_years_boston_celtics |>
# filter(season == 2025) |>
ggplot() +
geom_point(aes(x = field_goals_attempted, y = field_goals_made, size = team_score), alpha = 0.4) +
scale_size_continuous(
name = "Team Score",
breaks = c(110, 115, 120, 125, 130)
) +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
theme(legend.position = "bottom") +
ggtitle("Boston Celtics Field Goals 2021 - Present") +
coord_fixed()Watch out for
In both scatter and bubble plots overplotting can occur. Changing transparency can help, but sometimes there are too many points!
Solution
In these instances a heatmap can help.
Heatmaps are like a histogram, but data is binned in along both the x and y axis.
field_goals_heatmap_all <-
ggplot(data = all_years_boston_celtics) +
geom_hex(
aes(
x = field_goals_attempted,
y = field_goals_made
)) +
scale_fill_distiller(palette = "Greens", direction = 1, name = "Count") +
geom_density_2d(
aes(
x = field_goals_attempted,
y = field_goals_made
), alpha = 0.4, adjust = 1.2, col = "forestgreen") +
theme_bw() +
xlab("Field Goals Attempted (2 pts)") +
ylab("Field Goals Made (2 pts)") +
ggtitle("Boston Celtics Field Goals 2021 - 2024") +
coord_fixed()Summary
Learnt about plotting in R.
This included how to use the ggplot2 package in R, and
The grammar of graphics
Learnt ggplot2 is like Shrek!
Learnt how to create a range of plots in R
Remember the tips for learning R from Lecture 1 if you get stuck!
ETX2250/ETF5922