ETX2250/ETF5922

Data Wrangling for Plotting

Lecturer: Kate Saunders

Department of Econometrics and Business Statistics


  • etx2250-etf5922.caulfield-x@monash.edu
  • Lecture 5
  • <a href=“dvac.ss.numbat.space”>dvac.ss.numbat.space


Real World

Data often isn’t given to us in the form we want!!!

Getting Data

  • Often the variables we want to plot aren’t in the data set

  • We may need to create variables we want ourselves

  • Or get rid of variables we don’t need

  • Or create new summaries of our data for plotting

  • We might also need to restructure our data entirely to plot it

Today’s Lecture

Learning Objectives

  • Learn more about reading data in R so our data types are correct

  • Learn more about how we’d like to give and receive data

    • Codd’s Model / Tidy Data
  • Learn the basics of transforming data

    • filtering, selecting, and mutating (making new variables)
  • Learn how to create new data sets from existing data by:

    • grouping, summarising, reshaping, and joining

Importing Data

Data import

Overview

  • Data often stored in a single table

  • Columns are separated by a comma or a tab

  • The readr package is very useful for reading files into R as data frame objects

  • The readr package is part of the tidyverse packages

Data frame

Data frame

  • A data frame is an object with:
    • Each row corresponding to an observation or case,
    • Each column corresponding to a variable.
  • Two types of data frames in R are:
    • The data.frame in base R
    • The tibble in the tidyverse

The readr package

Read functions

  • Two of the more useful functions are read_csv and read_tsv for comma delimited and tab delimited files, respectively

  • Generally the defaults for this function work quite well.

  • One argument that is worth discussing a little is col_types

Code Description
col_logical() Logical (TRUE, FALSE, or NA)
col_integer() Integer
col_double() Double (numeric with decimals)
col_character() Character
col_factor() Categorical variable as a factor
col_date() Date (YYYY-MM-DD)
col_datetime() Date-time (POSIXct)
col_time() Time (HH:MM:SS)
col_skip() Skip this column entirely
col_guess() Let readr guess the column type

Characters or Dates?

Note

  • A common problem is dates reading in as a character string

  • This might have happened to you: here is how to fix it

library(tidyverse)
boston_celtics <- read_csv("data/boston_celtics.csv", 
           col_types = cols(game_date = col_date()))
str(boston_celtics$game_date)
 Date[1:401], format: "2021-06-01" "2021-05-30" "2021-05-28" "2021-05-25" "2021-05-22" ...

Reading in long numbers

Long numbers

  • The largest integer that many computers will accurately store is about 19 digits long.

  • Some identification numbers are longer than that (for instance IBAN numbers used for bank transfers).

  • In some cases, R may try to read these numbers in as integers.

  • To avoid errors these should be read in as characters.

# Example - code will not run
read_csv("data/bank_data.csv", 
           col_types = cols(
             IBAN = col_character()
           ))

Column types

Be careful

  • Specifying col_types = list(.default = "c") overrides the default behaviour of readr.

  • You can use this to force everything to be read in as a character.

  • If needed these can subsequently be converted to numeric variables or other variable types.

Tidy Data

Example

Suppose we have the following database:

# A tibble: 3 × 3
  Name  DoB        Email                   
  <chr> <chr>      <chr>                   
1 Ahmed 1994/03/01 ahmed@personal.com      
2 Bin   1954/12/23 bin@me.com; bin@work.com
3 Carol 1982/07/16 carol@mailcom           

Problems

Problems

  • Carol’s entry is not a valid email.
    • A function can be written to check and remove this.
  • A bigger problem is that the email entry is not atomic. There are two emails for Bin.
    • If code is written to check if the email is valid, then this code will also fail for Bin.
  • On the next slide is a solution.

Possible solution

# A tibble: 3 × 4
  Name  DoB        Email1             Email2      
  <chr> <chr>      <chr>              <chr>       
1 Ahmed 1994/03/01 ahmed@personal.com <NA>        
2 Bin   1954/12/23 bin@me.com         bin@work.com
3 Carol 1982/07/16 carol@mailcom      <NA>        

Problems

Still problems

  • This solution forces an ordering between Email1 and Email2 that may be arbitrary.

  • Also, a new entry into the database may be:

    • Name: Deepal
    • DoB: 1987/04/23
    • Email: deepal@work.com; deepal@me.com; coolgirl87@me.com
  • The entire database needs to be changed to allow for three email addresses.

Solution

# A tibble: 4 × 3
  Name  DoB        Email             
  <chr> <chr>      <chr>             
1 Ahmed 1994/03/01 ahmed@personal.com
2 Bin   1954/12/23 bin@me.com        
3 Bin   1954/12/23 bin@work.com      
4 Carol 1982/07/16 carol@mailcom     

Deepal can be added as

# A tibble: 3 × 3
  Name   DoB        Email            
  <chr>  <chr>      <chr>            
1 Deepal 1987/04/23 deepal@work.com  
2 Deepal 1987/04/23 deepal@me.com    
3 Deepal 1987/04/23 coolgirl87@me.com

First normal form (1NF)

We want data in an easily workable and robust form!

Databases

  • 1NF is an important concept in database normalisation

  • No repeating groups

  • Every field contains atomic (indivisible) values

  • Each record is unique

Data analysis

1NF is closely related to Tidy data principles.

Tidy Data Principles1

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

Codd’s model

dlpyr R package for data wrangling

  • Many of the functions in dplyr are similar to functions in the SQL (database language)

  • SQL is itself built upon (but not exactly the same) as a theoretical model known as Codd’s model

  • An important aspect of Codd’s model is the first normal form

  • Codd’s model provides guidelines for good data management

Why you care: TLDR

  • We map variables to our aesthetic layer for plotting

  • If our data is a tidy form then it’s easy for us to do that mapping

  • If not - We’ll use dlpyr tools to format our data

Selecting and Filtering

dplyr R package

Again

  • Data rarely comes in the format we want

A few simple functions

  • Much can be done with a few simple functions from the dplyr package:

    • Choose column variables with select
    • Choose rows of observations with filter
    • Transform and create new variables with mutate
  • In all cases both input and output is a data frame.

  • Check out the dplyr cheat sheet

Let’s look at an example

The diamonds data is a tibble include in R

diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Select variables

Note

  • The select function can be used if we only want to focus on a subset of variables

  • Check out ?select for all the different way to select columns

e.g. For the diamonds data set we may only be interested in carat, cut and price.

diamonds |> select(carat, cut, price)
# A tibble: 53,940 × 3
   carat cut       price
   <dbl> <ord>     <int>
 1  0.23 Ideal       326
 2  0.21 Premium     326
 3  0.23 Good        327
 4  0.29 Premium     334
 5  0.31 Good        335
 6  0.24 Very Good   336
 7  0.24 Very Good   336
 8  0.26 Very Good   337
 9  0.22 Fair        337
10  0.23 Very Good   338
# ℹ 53,930 more rows

Select variables

To drop variables, use a - sign.

diamonds |> select(-c(carat, cut, price))
# A tibble: 53,940 × 7
   color clarity depth table     x     y     z
   <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 E     SI2      61.5    55  3.95  3.98  2.43
 2 E     SI1      59.8    61  3.89  3.84  2.31
 3 E     VS1      56.9    65  4.05  4.07  2.31
 4 I     VS2      62.4    58  4.2   4.23  2.63
 5 J     SI2      63.3    58  4.34  4.35  2.75
 6 J     VVS2     62.8    57  3.94  3.96  2.48
 7 I     VVS1     62.3    57  3.95  3.98  2.47
 8 H     SI1      61.9    55  4.07  4.11  2.53
 9 E     VS2      65.1    61  3.87  3.78  2.49
10 H     VS1      59.4    61  4     4.05  2.39
# ℹ 53,930 more rows

Filter observations

Suppose we only want to consider diamonds worth more than $1000

diamonds |> filter(price > 1000)
# A tibble: 39,416 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.7  Ideal     E     SI1      62.5    57  2757  5.7   5.72  3.57
 2  0.86 Fair      E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.7  Ideal     G     VS2      61.6    56  2757  5.7   5.67  3.5 
 4  0.71 Very Good E     VS2      62.4    57  2759  5.68  5.73  3.56
 5  0.78 Very Good G     SI2      63.8    56  2759  5.81  5.85  3.72
 6  0.7  Good      E     VS2      57.5    58  2759  5.85  5.9   3.38
 7  0.7  Good      F     VS1      59.4    62  2759  5.71  5.76  3.4 
 8  0.96 Fair      F     SI2      66.3    62  2759  6.27  5.95  4.07
 9  0.73 Very Good E     SI1      61.6    59  2760  5.77  5.78  3.56
10  0.8  Premium   H     SI1      61.5    58  2760  5.97  5.93  3.66
# ℹ 39,406 more rows

Logical statements

Examples

  • The term price > 1000 is an example of a logical statement. It can be true or false.

  • Other examples are

    • Price less than 1000: price < 1000
    • Price less than or equal to 1000: price <= 1000
    • Price not equal to 1000: price != 1000
    • Price exactly 1000: price == 1000
  • Note two equal signs.

  • In general ! is the negation of a statement.

AND operator

  • If two statements need to be satisfied use &.
diamonds |> filter(price > 1000 & cut == 'Ideal')
# A tibble: 14,700 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.7  Ideal E     SI1      62.5    57  2757  5.7   5.72  3.57
 2  0.7  Ideal G     VS2      61.6    56  2757  5.7   5.67  3.5 
 3  0.74 Ideal G     SI1      61.6    55  2760  5.8   5.85  3.59
 4  0.8  Ideal I     VS1      62.9    56  2760  5.94  5.87  3.72
 5  0.75 Ideal G     SI1      62.2    55  2760  5.87  5.8   3.63
 6  0.74 Ideal I     VVS2     62.3    55  2761  5.77  5.81  3.61
 7  0.81 Ideal F     SI2      58.8    57  2761  6.14  6.11  3.6 
 8  0.59 Ideal E     VVS2     62      55  2761  5.38  5.43  3.35
 9  0.8  Ideal F     SI2      61.4    57  2761  5.96  6     3.67
10  0.74 Ideal E     SI2      62.2    56  2761  5.8   5.84  3.62
# ℹ 14,690 more rows

OR operator

  • If either one or the other statement needs to be satisfied use |.
diamonds |> filter(cut == 'Ideal' | color == 'E')
# A tibble: 27,445 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
 5  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
 6  0.31 Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
 7  0.2  Premium   E     SI2      60.2    62   345  3.79  3.75  2.27
 8  0.32 Premium   E     I1       60.9    58   345  4.38  4.42  2.68
 9  0.3  Ideal     I     SI2      62      54   348  4.31  4.34  2.68
10  0.23 Very Good E     VS2      63.8    55   352  3.85  3.92  2.48
# ℹ 27,435 more rows

IN operator

  • Another useful operator is %in%.
diamonds |> filter(cut %in% c('Ideal', 'Fair'))
# A tibble: 23,161 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 3  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 4  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 5  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 6  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 7  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 8  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 9  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
10  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
# ℹ 23,151 more rows

NOT operator

  • Use ! as not.
diamonds |> filter(!(cut %in% c('Ideal', 'Fair')))
# A tibble: 30,779 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 8  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
 9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
10  0.22 Premium   F     SI1      60.4    61   342  3.88  3.84  2.33
# ℹ 30,769 more rows

Your turn

Your turn

  • Write R code for the following:
    • price greater than 2000 and carat less than 3.
    • price greater than 2000 and cut either Very Good, Premium or Ideal.
    • carat less than 2 or price greater than 500.
    • carat less than 2 and cut not equal to Premium.

Creating New Variables

Let’s look at an example

The mpg (miles per gallon) data is a tibble include in R

mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Mutate

Example

  • Suppose we are interested in fuel efficiency in the city or on the highway
  • We’ll need the variables cty and hwy
  • These are measured in miles per gallon, but we want to convert into metric units
  • One mile per gallon equals 0.425144 km per litre.
mpg_subset <- mpg |> select(cty, hwy)
mpg_subset <- mpg_subset |> mutate( 
                     cty_metric = 0.425144 * cty,
                     hwy_metric = 0.425144 * hwy)

Result

# A tibble: 234 × 4
     cty   hwy cty_metric hwy_metric
   <int> <int>      <dbl>      <dbl>
 1    18    29       7.65       12.3
 2    21    29       8.93       12.3
 3    20    31       8.50       13.2
 4    21    30       8.93       12.8
 5    16    26       6.80       11.1
 6    18    26       7.65       11.1
 7    18    27       7.65       11.5
 8    18    26       7.65       11.1
 9    16    25       6.80       10.6
10    20    28       8.50       11.9
# ℹ 224 more rows

The pipe operator

Super useful

  • In dplyr as the input and output are always data frames.

  • That means chain commands together using |> (or older version %>%)

out <- c(1, 12, 4) |>
  mean() |>
  sqrt()

str(out)
 num 2.38

Pipes and dplyr

This code

mpg_subset <- mpg |> select(cty, hwy)
mpg_subset <- mpg_subset |> mutate( 
                     cty_metric = 0.425144 * cty,
                     hwy_metric = 0.425144 * hwy)

does the same as this code

mpg_subset <- mpg |> 
  select(cty, hwy) |> 
  mutate(cty_metric = 0.425144 * cty,
         hwy_metric = 0.425144 * hwy)

and it’s a lot more readable!

Your turn

Your turn

Start with mpg and practice using the |> operator:

  • Create the new variables for cty_metric and hwy_metric yourself

  • Add another new variable called efficient - cars are efficient if the hwy_metric < 6

  • Then create a data set of manufacturer and model that contains only the efficient cars

  • How many efficient cars are there? (Hint: look up ?distinct() and ?nrow())

Grouping

Let’s look at an example

Data

  • The txhousing data is a tibble included in R

  • Contains monthly sales data for each city from year 2000-2016

  • Suppose the aim is to find total number of sales across all cities in a year

txhousing
# A tibble: 8,602 × 9
   city     year month sales   volume median listings inventory  date
   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
 7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
 8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
 9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
# ℹ 8,592 more rows

Group by function

group_by

  • group_by allows us perform functions on groups within our data

  • e.g across a year, or a categorical variable

  • When used in combination with summarise function, it can construct new data sets

  • e.g get a total, mean, sd for that group

Group by and summarise

total_sales <- txhousing |>
  group_by(year) |>
  summarise(total_sales = sum(sales, na.rm = TRUE)) 
head(total_sales, n = 5)
# A tibble: 5 × 2
   year total_sales
  <int>       <dbl>
1  2000      222483
2  2001      231453
3  2002      234600
4  2003      253909
5  2004      283999

Note

  • Data for some cities is unavailable and is filled in as NA (early years in particular)

  • na.rm = TRUE removes missing data before taking the sum

group_by two variables

Two variables

  • Suppose we want to get the yearly total for each city

  • Need to group by city and year.

group_by and summarise

total_sales <- txhousing |>
  group_by(year, city) |>
  summarise(total_sales = sum(sales, na.rm = TRUE))
total_sales
# A tibble: 736 × 3
# Groups:   year [16]
    year city                  total_sales
   <int> <chr>                       <dbl>
 1  2000 Abilene                      1375
 2  2000 Amarillo                     2076
 3  2000 Arlington                    4969
 4  2000 Austin                      18621
 5  2000 Bay Area                     4884
 6  2000 Beaumont                     1781
 7  2000 Brazoria County              1060
 8  2000 Brownsville                   486
 9  2000 Bryan-College Station        1356
10  2000 Collin County               10000
# ℹ 726 more rows

Your turn

Your turn

  • In the diamonds data set, find the average price for each cut of diamond

  • In the diamonds data set, find the average price for each cut of diamond given that price is above 2000.

  • In the mpg data set find the average fuel efficiency in the city for each year.

  • In the mpg data set, consider Toyota, Nissan and Honda. Find the value of hwy for each of these manufacturer’s most fuel efficient car on the highway.

Reshaping

Long vs. wide format

Note

  • Generally attempt to have one column per variable.

  • Sometimes the meaning of a variable is ambiguous

  • In many cases it is easier to use ggplot2 if the data are reshaped into a different form.

  • We investigate using the economics data.

Economics: wide

economics
# A tibble: 574 × 6
   date         pce    pop psavert uempmed unemploy
   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
 1 1967-07-01  507. 198712    12.6     4.5     2944
 2 1967-08-01  510. 198911    12.6     4.7     2945
 3 1967-09-01  516. 199113    11.9     4.6     2958
 4 1967-10-01  512. 199311    12.9     4.9     3143
 5 1967-11-01  517. 199498    12.8     4.7     3066
 6 1967-12-01  525. 199657    11.8     4.8     3018
 7 1968-01-01  531. 199808    11.7     5.1     2878
 8 1968-02-01  534. 199920    12.3     4.5     3001
 9 1968-03-01  544. 200056    11.7     4.1     2877
10 1968-04-01  544  200208    12.3     4.6     2709
# ℹ 564 more rows
  • It has a column for each variable and a row for each observation/case.

Economics: long

economics_long
# A tibble: 2,870 × 4
   date       variable value  value01
   <date>     <chr>    <dbl>    <dbl>
 1 1967-07-01 pce       507. 0       
 2 1967-08-01 pce       510. 0.000265
 3 1967-09-01 pce       516. 0.000762
 4 1967-10-01 pce       512. 0.000471
 5 1967-11-01 pce       517. 0.000916
 6 1967-12-01 pce       525. 0.00157 
 7 1968-01-01 pce       531. 0.00207 
 8 1968-02-01 pce       534. 0.00230 
 9 1968-03-01 pce       544. 0.00322 
10 1968-04-01 pce       544  0.00319 
# ℹ 2,860 more rows
  • It has one column indicating the type of variable contained in that row and then a separate column for the value for that variable.

  • Both wide and long formats store the same information.

Use case of wide format

ggplot(economics, aes(x = uempmed, y = psavert)) +
  geom_point()

Use case of long format

ggplot(economics_long, aes(x = date, y = value)) +
  geom_line() +
  facet_wrap(vars(variable), scales = 'free_y')

We will learn about facet_wrap next workshop.

From wide to long

long_data <- economics |> 
  pivot_longer(
    cols = -date,
    names_to = 'Variable',
    values_to = 'Value')
long_data
# A tibble: 2,870 × 3
   date       Variable    Value
   <date>     <chr>       <dbl>
 1 1967-07-01 pce         507. 
 2 1967-07-01 pop      198712  
 3 1967-07-01 psavert      12.6
 4 1967-07-01 uempmed       4.5
 5 1967-07-01 unemploy   2944  
 6 1967-08-01 pce         510. 
 7 1967-08-01 pop      198911  
 8 1967-08-01 psavert      12.6
 9 1967-08-01 uempmed       4.7
10 1967-08-01 unemploy   2945  
# ℹ 2,860 more rows

Pivot longer

Arguments

  • The names_to will be a variable of column names.

  • The values_to will be a variable that stores the body of the data frame.

  • Use - in the cols argument to exclude variable(s).

  • Here this was done for the date variable.

From long to wide

wide_data <- economics_long |> 
  pivot_wider(
    id_cols = date,
    names_from = variable,
    values_from = value)
wide_data
# A tibble: 574 × 6
   date         pce    pop psavert uempmed unemploy
   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
 1 1967-07-01  507. 198712    12.6     4.5     2944
 2 1967-08-01  510. 198911    12.6     4.7     2945
 3 1967-09-01  516. 199113    11.9     4.6     2958
 4 1967-10-01  512. 199311    12.9     4.9     3143
 5 1967-11-01  517. 199498    12.8     4.7     3066
 6 1967-12-01  525. 199657    11.8     4.8     3018
 7 1968-01-01  531. 199808    11.7     5.1     2878
 8 1968-02-01  534. 199920    12.3     4.5     3001
 9 1968-03-01  544. 200056    11.7     4.1     2877
10 1968-04-01  544  200208    12.3     4.6     2709
# ℹ 564 more rows

Pivot wider

Arguments

  • The names_from will be a variable including column names of the new data frame.

  • The values_from will be a variable that will become the body of the new data frame.

  • The id_cols will be the variables used to uniquely identify the observations in the new data frame.

Your turn

Your turn

Consider again the mpg data. There is a column for cty and hwy.

  • Make a long format where there is a column for road_type (cty or hwy) and the values are in the new column called mpg

  • (Extra) You may also like to create a metric version of the variables and a new column for units

Joins

Joins

Multiple data sets

  • Often we work with multiple data sets

  • We need to know how to join data sets together

Example

Consider electricity and weather data (on Moodle):

  • Hot days we use our airconditioner

  • Cold days we use our heating

Makes sense to analyse these together

Energy data

Electricity price: demand and export by date, day and State

energy <- read_csv('data/energydata.csv')
energy
# A tibble: 1,960 × 6
   Date       Day   State Price Demand NetExport
   <date>     <chr> <chr> <dbl>  <dbl>     <dbl>
 1 2018-07-15 Sun   NSW    51.7  7564.   -1231. 
 2 2018-07-16 Mon   NSW    87.9  8966.     -18.6
 3 2018-07-17 Tue   NSW    62.8  8050.    -643. 
 4 2018-07-18 Wed   NSW    54.5  7840.    -742. 
 5 2018-07-19 Thu   NSW    64.2  8168.     -40.6
 6 2018-07-20 Fri   NSW    60.9  8254.     318. 
 7 2018-07-21 Sat   NSW    59.1  7427.    -550. 
 8 2018-07-22 Sun   NSW    56.0  7492.   -2054. 
 9 2018-07-23 Mon   NSW    60.7  8382.    -657. 
10 2018-07-24 Tue   NSW    60.5  7802.    -322. 
# ℹ 1,950 more rows

Weather data

Weather variables: Maximum temperature, wind direction and wind speed by Date

weather <- read_csv('data/weather.csv')
weather
# A tibble: 1,825 × 5
   Date       MaxTemp WindDir WindSpeed State
   <date>       <dbl> <chr>       <dbl> <chr>
 1 2018-07-01    15.4 S               6 NSW  
 2 2018-07-02    15   SSW             8 NSW  
 3 2018-07-03    16.5 S              15 NSW  
 4 2018-07-04    19.7 NNE             6 NSW  
 5 2018-07-05    25.1 NNW             5 NSW  
 6 2018-07-06    23.9 WNW            11 NSW  
 7 2018-07-07    15.7 WNW             8 NSW  
 8 2018-07-08    16.7 W              12 NSW  
 9 2018-07-09    15   S               8 NSW  
10 2018-07-10    16.1 ESE             3 NSW  
# ℹ 1,815 more rows

Types of join

Types of join

Types

  • Using inner_join(x, y) returns all rows from x with matching values in y.

  • Using left_join(x, y) returns all rows from x with matching values in y, with NA if there is no match.

  • Using right_join(x, y) returns all rows from y with matching values in x, with NA if there is no match.

  • Using full_join(x, y) returns all rows from x or y with NA if there is no match.

Types of join

Source: Wickham, H., Çetinkaya-Rundel, M., Grolemund, G. (2023). R for Data Science. United States: O’Reilly Media.

Your turn

Your turn

Try out each of the different joins and see what they do

Compare the resulting dataset size using dim()

For example:

energy_weather_inner <- inner_join(energy, weather, by = c('Date', 'State'))
energy_weather_inner
# A tibble: 1,755 × 9
   Date       Day   State Price Demand NetExport MaxTemp WindDir WindSpeed
   <date>     <chr> <chr> <dbl>  <dbl>     <dbl>   <dbl> <chr>       <dbl>
 1 2018-07-15 Sun   NSW    51.7  7564.   -1231.     18   NW              2
 2 2018-07-16 Mon   NSW    87.9  8966.     -18.6    18.5 W               7
 3 2018-07-17 Tue   NSW    62.8  8050.    -643.     22.5 WNW             9
 4 2018-07-18 Wed   NSW    54.5  7840.    -742.     20.8 SSW             1
 5 2018-07-19 Thu   NSW    64.2  8168.     -40.6    20.8 NNW             1
 6 2018-07-20 Fri   NSW    60.9  8254.     318.     15.5 W              13
 7 2018-07-21 Sat   NSW    59.1  7427.    -550.     16.3 SE              4
 8 2018-07-22 Sun   NSW    56.0  7492.   -2054.     15.9 NE              6
 9 2018-07-23 Mon   NSW    60.7  8382.    -657.     19.9 NW              4
10 2018-07-24 Tue   NSW    60.5  7802.    -322.     24.4 ENE             3
# ℹ 1,745 more rows

Summary

Summary

Summary

  • Covered changing data types in R when reading data

  • Reviewed good data practices: 1NF and tidy data

  • Learnt the basics of transforming data

    • dplyr package: select(), filter(), mutate()
    • Logical statements
  • Created new data sets

    • using group_by() and summarise()
  • Learnt how to reshape our data changing from long vs. wide formats

    • pivot_wider() or pivot_longer
  • Learnt how to joining data sets

    • inner_join(), left_join(), right_join() , full_join()

Reading data from the web

Getting data

Tip

  • Although data will sometimes be provided to you, it is useful to know some ways to get data off the web.

  • This can be integrated into your R workflow.

  • Techniques range from commands to download files to more sophisticated web scraping.

  • Websites can change so always make sure to save data as well.

Downloading files in R

  • As an example consider wholesale electricity prices which can be downloaded online.

Hover over download, right click and obtain link location.

Reading Data from a URL

We can use read_csv to also read data from a URL.

Example

url = "https://www.aemo.com.au/aemo/data/nem/priceanddemand/PRICE_AND_DEMAND_202411_NSW1.csv"

aemo_data <- read_csv(url)

head(aemo_data)
# A tibble: 6 × 5
  REGION SETTLEMENTDATE      TOTALDEMAND   RRP PERIODTYPE
  <chr>  <chr>                     <dbl> <dbl> <chr>     
1 NSW1   2024/11/01 00:05:00       6816. 100.  TRADE     
2 NSW1   2024/11/01 00:10:00       6841. 100.  TRADE     
3 NSW1   2024/11/01 00:15:00       6802. 113.  TRADE     
4 NSW1   2024/11/01 00:20:00       6787. 101.  TRADE     
5 NSW1   2024/11/01 00:25:00       6666. 100.  TRADE     
6 NSW1   2024/11/01 00:30:00       6643.  82.1 TRADE     
write_csv(aemo_data, "data/PRICE_AND_DEMAND_202411_NSW1.csv")

Benefits

Why use URL?

  • Location where data is retrieved is kept.

  • Usually URL are created systematically.

    • Data for Victoria will have VIC as part of URL.
    • Data for May 2024 will have 202405 as part of URL.
  • A large number of files can be downloaded and combined using a loop.