library(tidyverse)
boston_celtics <- read_csv("data/boston_celtics.csv",
col_types = cols(game_date = col_date()))
str(boston_celtics$game_date) Date[1:401], format: "2021-06-01" "2021-05-30" "2021-05-28" "2021-05-25" "2021-05-22" ...
Lecturer: Kate Saunders
Department of Econometrics and Business Statistics
Data often isn’t given to us in the form we want!!!
Getting Data
Often the variables we want to plot aren’t in the data set
We may need to create variables we want ourselves
Or get rid of variables we don’t need
Or create new summaries of our data for plotting
We might also need to restructure our data entirely to plot it
Learning Objectives
Learn more about reading data in R so our data types are correct
Learn more about how we’d like to give and receive data
Learn the basics of transforming data
Learn how to create new data sets from existing data by:
Overview
Data often stored in a single table
Columns are separated by a comma or a tab
The readr package is very useful for reading files into R as data frame objects
The readr package is part of the tidyverse packages
Data frame
data.frame in base Rtibble in the tidyverseRead functions
Two of the more useful functions are read_csv and read_tsv for comma delimited and tab delimited files, respectively
Generally the defaults for this function work quite well.
One argument that is worth discussing a little is col_types
| Code | Description |
|---|---|
col_logical() |
Logical (TRUE, FALSE, or NA) |
col_integer() |
Integer |
col_double() |
Double (numeric with decimals) |
col_character() |
Character |
col_factor() |
Categorical variable as a factor |
col_date() |
Date (YYYY-MM-DD) |
col_datetime() |
Date-time (POSIXct) |
col_time() |
Time (HH:MM:SS) |
col_skip() |
Skip this column entirely |
col_guess() |
Let readr guess the column type |
Note
A common problem is dates reading in as a character string
This might have happened to you: here is how to fix it
Long numbers
The largest integer that many computers will accurately store is about 19 digits long.
Some identification numbers are longer than that (for instance IBAN numbers used for bank transfers).
In some cases, R may try to read these numbers in as integers.
To avoid errors these should be read in as characters.
Be careful
Specifying col_types = list(.default = "c") overrides the default behaviour of readr.
You can use this to force everything to be read in as a character.
If needed these can subsequently be converted to numeric variables or other variable types.
Suppose we have the following database:
# A tibble: 3 × 3
Name DoB Email
<chr> <chr> <chr>
1 Ahmed 1994/03/01 ahmed@personal.com
2 Bin 1954/12/23 bin@me.com; bin@work.com
3 Carol 1982/07/16 carol@mailcom
Problems
# A tibble: 3 × 4
Name DoB Email1 Email2
<chr> <chr> <chr> <chr>
1 Ahmed 1994/03/01 ahmed@personal.com <NA>
2 Bin 1954/12/23 bin@me.com bin@work.com
3 Carol 1982/07/16 carol@mailcom <NA>
Still problems
This solution forces an ordering between Email1 and Email2 that may be arbitrary.
Also, a new entry into the database may be:
The entire database needs to be changed to allow for three email addresses.
# A tibble: 4 × 3
Name DoB Email
<chr> <chr> <chr>
1 Ahmed 1994/03/01 ahmed@personal.com
2 Bin 1954/12/23 bin@me.com
3 Bin 1954/12/23 bin@work.com
4 Carol 1982/07/16 carol@mailcom
Deepal can be added as
# A tibble: 3 × 3
Name DoB Email
<chr> <chr> <chr>
1 Deepal 1987/04/23 deepal@work.com
2 Deepal 1987/04/23 deepal@me.com
3 Deepal 1987/04/23 coolgirl87@me.com
We want data in an easily workable and robust form!
Databases
1NF is an important concept in database normalisation
No repeating groups
Every field contains atomic (indivisible) values
Each record is unique
1NF is closely related to Tidy data principles.
Tidy Data Principles1
dlpyr R package for data wrangling
Many of the functions in dplyr are similar to functions in the SQL (database language)
SQL is itself built upon (but not exactly the same) as a theoretical model known as Codd’s model
An important aspect of Codd’s model is the first normal form
Codd’s model provides guidelines for good data management
Why you care: TLDR
We map variables to our aesthetic layer for plotting
If our data is a tidy form then it’s easy for us to do that mapping
If not - We’ll use dlpyr tools to format our data
dplyr R packageAgain
A few simple functions
Much can be done with a few simple functions from the dplyr package:
selectfiltermutateIn all cases both input and output is a data frame.
Check out the dplyr cheat sheet
The diamonds data is a tibble include in R
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Note
The select function can be used if we only want to focus on a subset of variables
Check out ?select for all the different way to select columns
e.g. For the diamonds data set we may only be interested in carat, cut and price.
# A tibble: 53,940 × 3
carat cut price
<dbl> <ord> <int>
1 0.23 Ideal 326
2 0.21 Premium 326
3 0.23 Good 327
4 0.29 Premium 334
5 0.31 Good 335
6 0.24 Very Good 336
7 0.24 Very Good 336
8 0.26 Very Good 337
9 0.22 Fair 337
10 0.23 Very Good 338
# ℹ 53,930 more rows
To drop variables, use a - sign.
# A tibble: 53,940 × 7
color clarity depth table x y z
<ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
1 E SI2 61.5 55 3.95 3.98 2.43
2 E SI1 59.8 61 3.89 3.84 2.31
3 E VS1 56.9 65 4.05 4.07 2.31
4 I VS2 62.4 58 4.2 4.23 2.63
5 J SI2 63.3 58 4.34 4.35 2.75
6 J VVS2 62.8 57 3.94 3.96 2.48
7 I VVS1 62.3 57 3.95 3.98 2.47
8 H SI1 61.9 55 4.07 4.11 2.53
9 E VS2 65.1 61 3.87 3.78 2.49
10 H VS1 59.4 61 4 4.05 2.39
# ℹ 53,930 more rows
Suppose we only want to consider diamonds worth more than $1000
# A tibble: 39,416 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56
5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72
6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38
7 0.7 Good F VS1 59.4 62 2759 5.71 5.76 3.4
8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
# ℹ 39,406 more rows
Examples
The term price > 1000 is an example of a logical statement. It can be true or false.
Other examples are
price < 1000price <= 1000price != 1000price == 1000Note two equal signs.
In general ! is the negation of a statement.
&.# A tibble: 14,700 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
2 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
3 0.74 Ideal G SI1 61.6 55 2760 5.8 5.85 3.59
4 0.8 Ideal I VS1 62.9 56 2760 5.94 5.87 3.72
5 0.75 Ideal G SI1 62.2 55 2760 5.87 5.8 3.63
6 0.74 Ideal I VVS2 62.3 55 2761 5.77 5.81 3.61
7 0.81 Ideal F SI2 58.8 57 2761 6.14 6.11 3.6
8 0.59 Ideal E VVS2 62 55 2761 5.38 5.43 3.35
9 0.8 Ideal F SI2 61.4 57 2761 5.96 6 3.67
10 0.74 Ideal E SI2 62.2 56 2761 5.8 5.84 3.62
# ℹ 14,690 more rows
|.# A tibble: 27,445 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
7 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
8 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
9 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
10 0.23 Very Good E VS2 63.8 55 352 3.85 3.92 2.48
# ℹ 27,435 more rows
%in%.# A tibble: 23,161 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
3 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
4 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
5 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
6 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
7 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
8 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
9 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
10 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
# ℹ 23,151 more rows
! as not.# A tibble: 30,779 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
10 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
# ℹ 30,769 more rows
Your turn
price greater than 2000 and carat less than 3.price greater than 2000 and cut either Very Good, Premium or Ideal.carat less than 2 or price greater than 500.carat less than 2 and cut not equal to Premium.The mpg (miles per gallon) data is a tibble include in R
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Example
cty and hwy# A tibble: 234 × 4
cty hwy cty_metric hwy_metric
<int> <int> <dbl> <dbl>
1 18 29 7.65 12.3
2 21 29 8.93 12.3
3 20 31 8.50 13.2
4 21 30 8.93 12.8
5 16 26 6.80 11.1
6 18 26 7.65 11.1
7 18 27 7.65 11.5
8 18 26 7.65 11.1
9 16 25 6.80 10.6
10 20 28 8.50 11.9
# ℹ 224 more rows
Super useful
In dplyr as the input and output are always data frames.
That means chain commands together using |> (or older version %>%)
This code
does the same as this code
and it’s a lot more readable!
Your turn
Start with mpg and practice using the |> operator:
Create the new variables for cty_metric and hwy_metric yourself
Add another new variable called efficient - cars are efficient if the hwy_metric < 6
Then create a data set of manufacturer and model that contains only the efficient cars
How many efficient cars are there? (Hint: look up ?distinct() and ?nrow())
Data
The txhousing data is a tibble included in R
Contains monthly sales data for each city from year 2000-2016
Suppose the aim is to find total number of sales across all cities in a year
# A tibble: 8,602 × 9
city year month sales volume median listings inventory date
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
# ℹ 8,592 more rows
group_by
group_by allows us perform functions on groups within our data
e.g across a year, or a categorical variable
When used in combination with summarise function, it can construct new data sets
e.g get a total, mean, sd for that group
total_sales <- txhousing |>
group_by(year) |>
summarise(total_sales = sum(sales, na.rm = TRUE))
head(total_sales, n = 5)# A tibble: 5 × 2
year total_sales
<int> <dbl>
1 2000 222483
2 2001 231453
3 2002 234600
4 2003 253909
5 2004 283999
Note
Data for some cities is unavailable and is filled in as NA (early years in particular)
na.rm = TRUE removes missing data before taking the sum
Two variables
Suppose we want to get the yearly total for each city
Need to group by city and year.
total_sales <- txhousing |>
group_by(year, city) |>
summarise(total_sales = sum(sales, na.rm = TRUE))
total_sales# A tibble: 736 × 3
# Groups: year [16]
year city total_sales
<int> <chr> <dbl>
1 2000 Abilene 1375
2 2000 Amarillo 2076
3 2000 Arlington 4969
4 2000 Austin 18621
5 2000 Bay Area 4884
6 2000 Beaumont 1781
7 2000 Brazoria County 1060
8 2000 Brownsville 486
9 2000 Bryan-College Station 1356
10 2000 Collin County 10000
# ℹ 726 more rows
Your turn
In the diamonds data set, find the average price for each cut of diamond
In the diamonds data set, find the average price for each cut of diamond given that price is above 2000.
In the mpg data set find the average fuel efficiency in the city for each year.
In the mpg data set, consider Toyota, Nissan and Honda. Find the value of hwy for each of these manufacturer’s most fuel efficient car on the highway.
Note
Generally attempt to have one column per variable.
Sometimes the meaning of a variable is ambiguous
In many cases it is easier to use ggplot2 if the data are reshaped into a different form.
We investigate using the economics data.
# A tibble: 574 × 6
date pce pop psavert uempmed unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944
2 1967-08-01 510. 198911 12.6 4.7 2945
3 1967-09-01 516. 199113 11.9 4.6 2958
4 1967-10-01 512. 199311 12.9 4.9 3143
5 1967-11-01 517. 199498 12.8 4.7 3066
6 1967-12-01 525. 199657 11.8 4.8 3018
7 1968-01-01 531. 199808 11.7 5.1 2878
8 1968-02-01 534. 199920 12.3 4.5 3001
9 1968-03-01 544. 200056 11.7 4.1 2877
10 1968-04-01 544 200208 12.3 4.6 2709
# ℹ 564 more rows
# A tibble: 2,870 × 4
date variable value value01
<date> <chr> <dbl> <dbl>
1 1967-07-01 pce 507. 0
2 1967-08-01 pce 510. 0.000265
3 1967-09-01 pce 516. 0.000762
4 1967-10-01 pce 512. 0.000471
5 1967-11-01 pce 517. 0.000916
6 1967-12-01 pce 525. 0.00157
7 1968-01-01 pce 531. 0.00207
8 1968-02-01 pce 534. 0.00230
9 1968-03-01 pce 544. 0.00322
10 1968-04-01 pce 544 0.00319
# ℹ 2,860 more rows
It has one column indicating the type of variable contained in that row and then a separate column for the value for that variable.
Both wide and long formats store the same information.
We will learn about facet_wrap next workshop.
long_data <- economics |>
pivot_longer(
cols = -date,
names_to = 'Variable',
values_to = 'Value')
long_data# A tibble: 2,870 × 3
date Variable Value
<date> <chr> <dbl>
1 1967-07-01 pce 507.
2 1967-07-01 pop 198712
3 1967-07-01 psavert 12.6
4 1967-07-01 uempmed 4.5
5 1967-07-01 unemploy 2944
6 1967-08-01 pce 510.
7 1967-08-01 pop 198911
8 1967-08-01 psavert 12.6
9 1967-08-01 uempmed 4.7
10 1967-08-01 unemploy 2945
# ℹ 2,860 more rows
Arguments
The names_to will be a variable of column names.
The values_to will be a variable that stores the body of the data frame.
Use - in the cols argument to exclude variable(s).
Here this was done for the date variable.
wide_data <- economics_long |>
pivot_wider(
id_cols = date,
names_from = variable,
values_from = value)
wide_data# A tibble: 574 × 6
date pce pop psavert uempmed unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944
2 1967-08-01 510. 198911 12.6 4.7 2945
3 1967-09-01 516. 199113 11.9 4.6 2958
4 1967-10-01 512. 199311 12.9 4.9 3143
5 1967-11-01 517. 199498 12.8 4.7 3066
6 1967-12-01 525. 199657 11.8 4.8 3018
7 1968-01-01 531. 199808 11.7 5.1 2878
8 1968-02-01 534. 199920 12.3 4.5 3001
9 1968-03-01 544. 200056 11.7 4.1 2877
10 1968-04-01 544 200208 12.3 4.6 2709
# ℹ 564 more rows
Arguments
The names_from will be a variable including column names of the new data frame.
The values_from will be a variable that will become the body of the new data frame.
The id_cols will be the variables used to uniquely identify the observations in the new data frame.
Your turn
Consider again the mpg data. There is a column for cty and hwy.
Make a long format where there is a column for road_type (cty or hwy) and the values are in the new column called mpg
(Extra) You may also like to create a metric version of the variables and a new column for units
Multiple data sets
Often we work with multiple data sets
We need to know how to join data sets together
Example
Consider electricity and weather data (on Moodle):
Hot days we use our airconditioner
Cold days we use our heating
Makes sense to analyse these together
Electricity price: demand and export by date, day and State
# A tibble: 1,960 × 6
Date Day State Price Demand NetExport
<date> <chr> <chr> <dbl> <dbl> <dbl>
1 2018-07-15 Sun NSW 51.7 7564. -1231.
2 2018-07-16 Mon NSW 87.9 8966. -18.6
3 2018-07-17 Tue NSW 62.8 8050. -643.
4 2018-07-18 Wed NSW 54.5 7840. -742.
5 2018-07-19 Thu NSW 64.2 8168. -40.6
6 2018-07-20 Fri NSW 60.9 8254. 318.
7 2018-07-21 Sat NSW 59.1 7427. -550.
8 2018-07-22 Sun NSW 56.0 7492. -2054.
9 2018-07-23 Mon NSW 60.7 8382. -657.
10 2018-07-24 Tue NSW 60.5 7802. -322.
# ℹ 1,950 more rows
Weather variables: Maximum temperature, wind direction and wind speed by Date
# A tibble: 1,825 × 5
Date MaxTemp WindDir WindSpeed State
<date> <dbl> <chr> <dbl> <chr>
1 2018-07-01 15.4 S 6 NSW
2 2018-07-02 15 SSW 8 NSW
3 2018-07-03 16.5 S 15 NSW
4 2018-07-04 19.7 NNE 6 NSW
5 2018-07-05 25.1 NNW 5 NSW
6 2018-07-06 23.9 WNW 11 NSW
7 2018-07-07 15.7 WNW 8 NSW
8 2018-07-08 16.7 W 12 NSW
9 2018-07-09 15 S 8 NSW
10 2018-07-10 16.1 ESE 3 NSW
# ℹ 1,815 more rows
Types
Using inner_join(x, y) returns all rows from x with matching values in y.
Using left_join(x, y) returns all rows from x with matching values in y, with NA if there is no match.
Using right_join(x, y) returns all rows from y with matching values in x, with NA if there is no match.
Using full_join(x, y) returns all rows from x or y with NA if there is no match.
Source: Wickham, H., Çetinkaya-Rundel, M., Grolemund, G. (2023). R for Data Science. United States: O’Reilly Media.
Your turn
Try out each of the different joins and see what they do
Compare the resulting dataset size using dim()
For example:
# A tibble: 1,755 × 9
Date Day State Price Demand NetExport MaxTemp WindDir WindSpeed
<date> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 2018-07-15 Sun NSW 51.7 7564. -1231. 18 NW 2
2 2018-07-16 Mon NSW 87.9 8966. -18.6 18.5 W 7
3 2018-07-17 Tue NSW 62.8 8050. -643. 22.5 WNW 9
4 2018-07-18 Wed NSW 54.5 7840. -742. 20.8 SSW 1
5 2018-07-19 Thu NSW 64.2 8168. -40.6 20.8 NNW 1
6 2018-07-20 Fri NSW 60.9 8254. 318. 15.5 W 13
7 2018-07-21 Sat NSW 59.1 7427. -550. 16.3 SE 4
8 2018-07-22 Sun NSW 56.0 7492. -2054. 15.9 NE 6
9 2018-07-23 Mon NSW 60.7 8382. -657. 19.9 NW 4
10 2018-07-24 Tue NSW 60.5 7802. -322. 24.4 ENE 3
# ℹ 1,745 more rows
Summary
Covered changing data types in R when reading data
Reviewed good data practices: 1NF and tidy data
Learnt the basics of transforming data
dplyr package: select(), filter(), mutate()Created new data sets
group_by() and summarise()Learnt how to reshape our data changing from long vs. wide formats
pivot_wider() or pivot_longerLearnt how to joining data sets
inner_join(), left_join(), right_join() , full_join()Tip
Although data will sometimes be provided to you, it is useful to know some ways to get data off the web.
This can be integrated into your R workflow.
Techniques range from commands to download files to more sophisticated web scraping.
Websites can change so always make sure to save data as well.
Hover over download, right click and obtain link location.
We can use read_csv to also read data from a URL.
url = "https://www.aemo.com.au/aemo/data/nem/priceanddemand/PRICE_AND_DEMAND_202411_NSW1.csv"
aemo_data <- read_csv(url)
head(aemo_data)# A tibble: 6 × 5
REGION SETTLEMENTDATE TOTALDEMAND RRP PERIODTYPE
<chr> <chr> <dbl> <dbl> <chr>
1 NSW1 2024/11/01 00:05:00 6816. 100. TRADE
2 NSW1 2024/11/01 00:10:00 6841. 100. TRADE
3 NSW1 2024/11/01 00:15:00 6802. 113. TRADE
4 NSW1 2024/11/01 00:20:00 6787. 101. TRADE
5 NSW1 2024/11/01 00:25:00 6666. 100. TRADE
6 NSW1 2024/11/01 00:30:00 6643. 82.1 TRADE
Why use URL?
Location where data is retrieved is kept.
Usually URL are created systematically.
A large number of files can be downloaded and combined using a loop.
ETX2250/ETF5922