The aim of this tutorial is to introduce you to the tidyverse.
While there are quite a few references offering some more background, I recommend consulting these for further reading:
It is effectively a reimagining of data analytics by requiring your data to be in a tidy format.
Once your data is tidy, analyzing and working with it is simpler, more stable and more efficient - and you will never need to type out clumsy for-loops again (at least in the traditional sense of the word).
Tidy data means each variable is in its own column, with each observation being in its own row.
We’ll get to what this means in examples later on.
Let’s first discuss the tidyverse:
The tidyverse package is a collection of various other packages, including:
dplyr : awesome for data wrangling
readr : awesome for importing data
ggplot2 : the standard for plotting in a tidy framework
tidyr : essential for manipulating dataframes
purrr : makes your code (and you) purrr with delight… you’ll see. Absolute synctatic sugar. Pure elation.
A key command in using the tidyverse is the piping call ‘%>%’.
x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x,y)
We can use a dot to refer to the data input: x %>% fun(.) is equivalent to f(x)
Type CTRL SHIFT M in Rstudio, and you will see the pipe appear. This will be your friend from now on.
‘%>%’ should be thought of and ‘read’ as ‘then’.
Thus, X %>% f(…) %>% g(…) should be read as X then apply f(…) then apply g(…)
This makes your code much more readable (going from left to right) and easier to manage.
Let’s see it in action quick using the ‘select’ command:
## # A tibble: 790 × 3
## Date brz chl
## <date> <dbl> <dbl>
## 1 2000-01-14 1420. NA
## 2 2000-01-21 1382. NA
## 3 2000-01-28 1344. NA
## 4 2000-02-04 1426. 1522.
## 5 2000-02-11 1416. 1544.
## 6 2000-02-18 1372. 1443.
## 7 2000-02-25 1385. 1380.
## 8 2000-03-03 1439. 1456.
## 9 2000-03-10 1424. 1421.
## 10 2000-03-17 1383. 1419.
## # … with 780 more rows
The cornerstone of the tidyverse is to make your data, well, tidy.
This means:
It is time to reboot how you think about dataframes.
dplyr requires you to think in a tidy way about your data - in particular, that your data should be in a long format, as opposed to wide.
The thinking behind dplyr is to oranize your data in such a way that it makes analysis simpler (much simpler).
The idea behind tidy data can be described as follows:
If we think about our earlier dataframe - making it tidy implies:
Look at the figure above - what this now does is allow us to easily apply the same functions on each Country sequentially.
Examples later will illustrate the ease with which this can be done.
Naturally this requires us to decide what to call the gather group, what to call the values so gathered, and which columns to gather together.
pacman::p_load(tidyverse)
data_use <- fmxdat::Indexes
Tidydat <- data_use %>% tidyr::gather(Country, val, -Date) %>% arrange(Date)
print(Tidydat)
## # A tibble: 16,590 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 chl NA
## 3 2000-01-14 chn 36.7
## 4 2000-01-14 col 105.
## 5 2000-01-14 cze 91.7
## 6 2000-01-14 egt 352.
## 7 2000-01-14 grc 1215.
## 8 2000-01-14 hun 369.
## 9 2000-01-14 ind 175.
## 10 2000-01-14 ino 280.
## # … with 16,580 more rows
Now why would this be a preferable format?
Consider now the ease with which we could, e.g., calculate the:
Tidydat %>%
# When calculating returns - ensure your data is ordered by date
arrange(Date) %>%
group_by(Country) %>% mutate(Returns = val / lag(val) - 1) %>%
ungroup()
## # A tibble: 16,590 × 4
## Date Country val Returns
## <date> <chr> <dbl> <dbl>
## 1 2000-01-14 brz 1420. NA
## 2 2000-01-14 chl NA NA
## 3 2000-01-14 chn 36.7 NA
## 4 2000-01-14 col 105. NA
## 5 2000-01-14 cze 91.7 NA
## 6 2000-01-14 egt 352. NA
## 7 2000-01-14 grc 1215. NA
## 8 2000-01-14 hun 369. NA
## 9 2000-01-14 ind 175. NA
## 10 2000-01-14 ino 280. NA
## # … with 16,580 more rows
Notice my use of the verb ‘mutate’ above.
Mutate adds a column by applying the function given. The idea can be summarised as:
df = mutate( new_column_name = f ( Exiting_column) )
Tidydat %>%
# When calculating returns - ensure your data is ordered by date
arrange(Date) %>%
group_by(Country) %>%
mutate(Returns = val / lag(val) - 1) %>%
mutate(Std_Returns = ( Returns - mean(Returns, na.rm=T) ) / sd(Returns, na.rm=T) ) %>%
ungroup()
## # A tibble: 16,590 × 5
## Date Country val Returns Std_Returns
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2000-01-14 brz 1420. NA NA
## 2 2000-01-14 chl NA NA NA
## 3 2000-01-14 chn 36.7 NA NA
## 4 2000-01-14 col 105. NA NA
## 5 2000-01-14 cze 91.7 NA NA
## 6 2000-01-14 egt 352. NA NA
## 7 2000-01-14 grc 1215. NA NA
## 8 2000-01-14 hun 369. NA NA
## 9 2000-01-14 ind 175. NA NA
## 10 2000-01-14 ino 280. NA NA
## # … with 16,580 more rows
‘filtering’ is another useful verb. It allows easy subsetting of your data. E.g. let’s ‘filter’ out the first date of our dataframe above (which has NA returns):
Tidydat %>%
# When calculating returns - ensure your data is ordered by date
arrange(Date) %>%
group_by(Country) %>%
mutate(Returns = val / lag(val) - 1) %>%
filter(Date > first(Date)) %>%
ungroup()
## # A tibble: 16,569 × 4
## Date Country val Returns
## <date> <chr> <dbl> <dbl>
## 1 2000-01-21 brz 1382. -0.0267
## 2 2000-01-21 chl NA NA
## 3 2000-01-21 chn 36.1 -0.0172
## 4 2000-01-21 col 102. -0.0281
## 5 2000-01-21 cze 98.3 0.0722
## 6 2000-01-21 egt 375. 0.0656
## 7 2000-01-21 grc 1177. -0.0320
## 8 2000-01-21 hun 379. 0.0271
## 9 2000-01-21 ind 179. 0.0232
## 10 2000-01-21 ino 261. -0.0670
## # … with 16,559 more rows
Filter can also be used nicely with vector aruments…
## # A tibble: 1,580 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 zar 257.
## 3 2000-01-21 brz 1382.
## 4 2000-01-21 zar 254.
## 5 2000-01-28 brz 1344.
## 6 2000-01-28 zar 229.
## 7 2000-02-04 brz 1426.
## 8 2000-02-04 zar 238.
## 9 2000-02-11 brz 1416.
## 10 2000-02-11 zar 239.
## # … with 1,570 more rows
…grepls (e.g. countries with a z or a b in their names)….
## # A tibble: 2,370 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 cze 91.7
## 3 2000-01-14 zar 257.
## 4 2000-01-21 brz 1382.
## 5 2000-01-21 cze 98.3
## 6 2000-01-21 zar 254.
## 7 2000-01-28 brz 1344.
## 8 2000-01-28 cze 97.1
## 9 2000-01-28 zar 229.
## 10 2000-02-04 brz 1426.
## # … with 2,360 more rows
… dates (notice lubridate package)
pacman::p_load(lubridate)
# ....all dates after 2001 Jan 5:
Tidydat %>% filter( Date >= lubridate::ymd(20010105))
## # A tibble: 15,519 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2001-01-05 brz 1286.
## 2 2001-01-05 chl 1202.
## 3 2001-01-05 chn 26.8
## 4 2001-01-05 col 55.8
## 5 2001-01-05 cze 91.0
## 6 2001-01-05 egt 187.
## 7 2001-01-05 grc 700.
## 8 2001-01-05 hun 257.
## 9 2001-01-05 ind 136.
## 10 2001-01-05 ino 111.
## # … with 15,509 more rows
## # A tibble: 189 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2015-01-02 brz 4980.
## 2 2015-01-02 chl 4381.
## 3 2015-01-02 chn 115.
## 4 2015-01-02 col 1745.
## 5 2015-01-02 cze 749.
## 6 2015-01-02 egt 1921.
## 7 2015-01-02 grc 171.
## 8 2015-01-02 hun 478.
## 9 2015-01-02 ind 707.
## 10 2015-01-02 ino 1727.
## # … with 179 more rows
… values
## # A tibble: 5,512 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 grc 1215.
## 3 2000-01-14 mex 2377.
## 4 2000-01-14 tur 1048.
## 5 2000-01-21 brz 1382.
## 6 2000-01-21 grc 1177.
## 7 2000-01-21 mex 2216.
## 8 2000-01-28 brz 1344.
## 9 2000-01-28 grc 1125.
## 10 2000-01-28 mex 2106.
## # … with 5,502 more rows
How. Easy. Is. That.
Almost similar to mutate - but with the added feature of collapsing your data according to the function provided.
Best illustrated with example:
## # A tibble: 16,590 × 4
## # Groups: Country [21]
## Date Country val Max
## <date> <chr> <dbl> <dbl>
## 1 2000-01-14 brz 1420. 10340.
## 2 2000-01-14 chl NA 7483.
## 3 2000-01-14 chn 36.7 140.
## 4 2000-01-14 col 105. 2881.
## 5 2000-01-14 cze 91.7 1346.
## 6 2000-01-14 egt 352. 2675.
## 7 2000-01-14 grc 1215. 1975.
## 8 2000-01-14 hun 369. 1595.
## 9 2000-01-14 ind 175. 885.
## 10 2000-01-14 ino 280. 2072.
## # … with 16,580 more rows
## # A tibble: 21 × 2
## Country Max
## <chr> <dbl>
## 1 brz 10340.
## 2 chl 7483.
## 3 chn 140.
## 4 col 2881.
## 5 cze 1346.
## 6 egt 2675.
## 7 grc 1975.
## 8 hun 1595.
## 9 ind 885.
## 10 ino 2072.
## # … with 11 more rows
Notice in the second chunk, summarise only returns the max values by the groups asked. Mutate adds a column, repeating the max values at each date.
Joining dataframes is key to good analytics.
Tidy joining is typically done using left_join or full_join by some common column.
Let’s work out an example to show this.
Suppose your tidydat was stored as two excel files with the following formats:
## # A tibble: 790 × 22
## Date brz chl chn col cze egt grc hun ind ino kor
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-14 1420. NA 36.7 105. 91.7 352. 1215. 369. 175. 280. 192.
## 2 2000-01-21 1382. NA 36.1 102. 98.3 375. 1177. 379. 179. 261. 186.
## 3 2000-01-28 1344. NA 36.0 104. 97.1 354. 1125. 379. 187. 246. 191.
## 4 2000-02-04 1426. 1522. 35.5 105. 104. 349. 1202. 370. 198. 239. 189.
## 5 2000-02-11 1416. 1544. 31.8 104. 114. 359. 1257. 396. 220. 249. 188.
## 6 2000-02-18 1372. 1443. 33.2 97.7 114. 350. 1209. 402. 243. 231. 172.
## 7 2000-02-25 1385. 1380. 29.7 90.3 112. 322. 1195. 405. 222. 219. 168.
## 8 2000-03-03 1439. 1456. 28.5 88.2 117. 327. 1164. 401. 217. 208. 177.
## 9 2000-03-10 1424. 1421. 29.8 88.4 117. 327. 1111. 425. 211. 234. 174.
## 10 2000-03-17 1383. 1419. 29.9 87.9 118. 328. 1099. 399. 189. 230. 169.
## # … with 780 more rows, and 10 more variables: mal <dbl>, mex <dbl>, per <dbl>,
## # phi <dbl>, pol <dbl>, rus <dbl>, zar <dbl>, twn <dbl>, tha <dbl>, tur <dbl>
## # A tibble: 9 × 2
## Country Label
## <chr> <chr>
## 1 brz Brics
## 2 chn Brics
## 3 zar Brics
## 4 ind Brics
## 5 rus Brics
## 6 mex Mint
## 7 tur Mint
## 8 hun HEGP
## 9 egt HEGP
Notice now that combining this information to use it to do e.g. calcs by BRICS group woudl’ve otherwise been onerous.
Now it is simple using ‘left join’ - to join the Label from the second dataframe to the first:
Joined_Set <-
left_join(
fmxdat::Indexes %>% gather(Country, Return, -Date),
fmxdat::Indexes_Labs,
by = "Country"
) %>% arrange(Date)
Notice that ‘joining’ allocates Label by Country, replacing non-entry Labels with NA (e.g. chl, col and cze had no label, thus get NA labels)
In the above example full_join and left_join would produce equivalent results, but of course if the ‘right’ dataframe (Indexes_Labs) had other information (like dates) too - it would’ve ensured all data is preserved.
left_join, however, will join from left to right (and discard e.g. dates in the right df from joining to the left. Full join will join all dates)
Let’s now use these verbs on a wide dataframe to do some nice things.
First, our dataset is wide (which should start bothering you):
## # A tibble: 12 × 15
## date SubFactors AAA S…¹ BBB S…² CCC S…³ DDD S…⁴ EEE S…⁵ FFF S…⁶ TTT S…⁷
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2002-01-31 EBIT.EV 15.8 7.78 14.8 13.6 12.5 10.4 -1.23
## 2 2002-01-31 FCF.EV 10.5 6.78 12.3 16.0 13.5 12.0 -1.47
## 3 2002-01-31 ROA 13.4 15.7 12.6 9.71 10.5 8.55 -1.42
## 4 2002-01-31 ROE 3.22 15.9 11.9 10.9 11.9 14.3 0.499
## 5 2002-01-31 TRR6M1M 13.4 14.4 9.76 24.6 17.4 18.4 0.482
## 6 2002-01-31 Volat.D.60 11.4 9.82 12.5 13.0 15.6 21.3 -0.138
## 7 2002-02-28 EBIT.EV 5.83 10.7 10.3 17.2 5.04 7.92 1.07
## 8 2002-02-28 FCF.EV 11.6 9.17 10.5 5.60 13.9 5.52 0.185
## 9 2002-02-28 ROA 4.32 11.5 10.0 13.0 10.2 9.52 -0.930
## 10 2002-02-28 ROE 15.0 13.6 13.0 11.8 20.0 1.98 0.633
## 11 2002-02-28 TRR6M1M 13.7 9.32 10.6 11.3 14.5 15.1 0.517
## 12 2002-02-28 Volat.D.60 6.89 7.87 13.1 10.6 12.2 8.28 -1.21
## # … with 6 more variables: `UUU SPGLOB Equity` <dbl>,
## # `VVV SPGLOB Equity` <dbl>, `XXX SPGLOB Equity` <dbl>,
## # `YYY SPGLOB Equity` <dbl>, `ZZZ SPGLOB Equity` <dbl>, AnotherColumn <chr>,
## # and abbreviated variable names ¹`AAA SJ Equity`, ²`BBB SJ Equity`,
## # ³`CCC SJ Equity`, ⁴`DDD SJ Equity`, ⁵`EEE SJ Equity`, ⁶`FFF SJ Equity`,
## # ⁷`TTT SPGLOB Equity`
Before we gather - let’s plan.
First, our columns suggest two types of stocks - some are from SA (ending with ’ SJ Equity’), while others are US (ending with ’ SPGLOB Equity’).
Second, the values correspond to the SubFactors columm.
Let’s gather these by Stocks generally first, after which we create a column where the countries are explicitly labelled.
To do so, we need to exploit the pattern in the column names. Here we can use the adverbs contains, starts_with, ends_with, one_of, etc.
Let’s use ‘contains’ and ‘Equity’ - which will suffice as it gathers all the stocks (and ignores ‘Date’, ‘AnotherColumn’, and ‘SubFactors’:
## # A tibble: 144 × 5
## date SubFactors AnotherColumn Stocks Value
## <date> <chr> <chr> <chr> <dbl>
## 1 2002-01-31 EBIT.EV Random AAA SJ Equity 15.8
## 2 2002-01-31 FCF.EV Random AAA SJ Equity 10.5
## 3 2002-01-31 ROA Random AAA SJ Equity 13.4
## 4 2002-01-31 ROE Random AAA SJ Equity 3.22
## 5 2002-01-31 TRR6M1M Random AAA SJ Equity 13.4
## 6 2002-01-31 Volat.D.60 Random AAA SJ Equity 11.4
## 7 2002-02-28 EBIT.EV Random AAA SJ Equity 5.83
## 8 2002-02-28 FCF.EV Random AAA SJ Equity 11.6
## 9 2002-02-28 ROA Random AAA SJ Equity 4.32
## 10 2002-02-28 ROE Random AAA SJ Equity 15.0
## # … with 134 more rows
Awesome. Let’s now create a column labelling the country based on SJ and SPGLOB using grepl, after which we gsub out the ’ SJ Equity’ and ’ SPGLOB Equity’ parts (see how nice these now play along with mutate…..):
df_tidy_Cleaned <-
df_tidy %>%
mutate(Country = ifelse( grepl(" SJ Equity", Stocks), "SA",
ifelse( grepl(" SPGLOB Equity", Stocks), "SPGLOB", NA)) ) %>%
mutate(Stocks = gsub(" SJ Equity| SPGLOB Equity", "", Stocks))
print(df_tidy_Cleaned)
## # A tibble: 144 × 6
## date SubFactors AnotherColumn Stocks Value Country
## <date> <chr> <chr> <chr> <dbl> <chr>
## 1 2002-01-31 EBIT.EV Random AAA 15.8 SA
## 2 2002-01-31 FCF.EV Random AAA 10.5 SA
## 3 2002-01-31 ROA Random AAA 13.4 SA
## 4 2002-01-31 ROE Random AAA 3.22 SA
## 5 2002-01-31 TRR6M1M Random AAA 13.4 SA
## 6 2002-01-31 Volat.D.60 Random AAA 11.4 SA
## 7 2002-02-28 EBIT.EV Random AAA 5.83 SA
## 8 2002-02-28 FCF.EV Random AAA 11.6 SA
## 9 2002-02-28 ROA Random AAA 4.32 SA
## 10 2002-02-28 ROE Random AAA 15.0 SA
## # … with 134 more rows
This takes a bit of time getting used to - so practice this a bit.
If you want to go back to a wide format, spread is the opposite of gather.
E.g. let’s spread all the ROA and ROE values for SA stocks (notice my use of %in% below….):
df_tidy_Cleaned %>%
filter(SubFactors %in% c("ROA", "ROE"), Country == "SA") %>%
spread(SubFactors, Value)
## # A tibble: 12 × 6
## date AnotherColumn Stocks Country ROA ROE
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2002-01-31 Random AAA SA 13.4 3.22
## 2 2002-01-31 Random BBB SA 15.7 15.9
## 3 2002-01-31 Random CCC SA 12.6 11.9
## 4 2002-01-31 Random DDD SA 9.71 10.9
## 5 2002-01-31 Random EEE SA 10.5 11.9
## 6 2002-01-31 Random FFF SA 8.55 14.3
## 7 2002-02-28 Random AAA SA 4.32 15.0
## 8 2002-02-28 Random BBB SA 11.5 13.6
## 9 2002-02-28 Random CCC SA 10.0 13.0
## 10 2002-02-28 Random DDD SA 13.0 11.8
## 11 2002-02-28 Random EEE SA 10.2 20.0
## 12 2002-02-28 Random FFF SA 9.52 1.98
There are many many many combinations of these core verbs that allow efficient data analytics - sometimes with the help of some auxilliary verbs.
E.g., let’s use this dplyr native dataframe: ‘dplyr::storms’ to do some nice wrangling from which you can learn a few tricks… (I will add the tricks in brackets before using it)
## # A tibble: 19,066 × 13
## name year month day hour lat long status categ…¹ wind press…²
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical dep… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical dep… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical dep… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical dep… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical dep… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical dep… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical dep… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical dep… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical sto… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical sto… NA 40 1002
## # … with 19,056 more rows, 2 more variables:
## # tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>, and
## # abbreviated variable names ¹category, ²pressure
Let’s see which statusses are available (status, unique):
## [1] tropical depression tropical storm extratropical
## [4] hurricane subtropical storm subtropical depression
## [7] disturbance other low tropical wave
## 9 Levels: disturbance extratropical hurricane ... tropical wave
Let’s replace NA values in hurricane_force_diameter with something elaborate: the ‘lat’ column’s minimum values times the ‘long’ column’s max values (coalesce):
dplyr::storms %>% mutate(hurricane_force_diameter = coalesce(hurricane_force_diameter, min(lat, na.rm=T) * max(long, na.rm=T)))
## # A tibble: 19,066 × 13
## name year month day hour lat long status categ…¹ wind press…²
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical dep… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical dep… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical dep… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical dep… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical dep… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical dep… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical dep… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical dep… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical sto… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical sto… NA 40 1002
## # … with 19,056 more rows, 2 more variables:
## # tropicalstorm_force_diameter <int>, hurricane_force_diameter <dbl>, and
## # abbreviated variable names ¹category, ²pressure
Let’s apply the same function on several columns using across:
dplyr::storms %>% mutate( across( .cols = c(lat, long, pressure, tropicalstorm_force_diameter),
.fns = ~max(., na.rm=T)) )
## # A tibble: 19,066 × 13
## name year month day hour lat long status categ…¹ wind press…²
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 70.7 13.5 tropical dep… NA 25 1024
## 2 Amy 1975 6 27 6 70.7 13.5 tropical dep… NA 25 1024
## 3 Amy 1975 6 27 12 70.7 13.5 tropical dep… NA 25 1024
## 4 Amy 1975 6 27 18 70.7 13.5 tropical dep… NA 25 1024
## 5 Amy 1975 6 28 0 70.7 13.5 tropical dep… NA 25 1024
## 6 Amy 1975 6 28 6 70.7 13.5 tropical dep… NA 25 1024
## 7 Amy 1975 6 28 12 70.7 13.5 tropical dep… NA 25 1024
## 8 Amy 1975 6 28 18 70.7 13.5 tropical dep… NA 30 1024
## 9 Amy 1975 6 29 0 70.7 13.5 tropical sto… NA 35 1024
## 10 Amy 1975 6 29 6 70.7 13.5 tropical sto… NA 40 1024
## # … with 19,056 more rows, 2 more variables:
## # tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>, and
## # abbreviated variable names ¹category, ²pressure
# Or using summarise and a naming pattern:
dplyr::storms %>% summarise( across( .cols = c(contains("diameter"), starts_with("press")),
.fns = ~max(., na.rm=T)) )
## # A tibble: 1 × 3
## tropicalstorm_force_diameter hurricane_force_diameter pressure
## <int> <int> <int>
## 1 1440 300 1024
Let’s apply several functions on several columns - naming each accordingly using a suffix
dplyr::storms %>% mutate( across( .cols = c(lat, long, pressure, tropicalstorm_force_diameter),
.fns = list( Max = ~max(., na.rm=T), Min = ~min(., na.rm=T)) ))
## # A tibble: 19,066 × 21
## name year month day hour lat long status categ…¹ wind press…²
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical dep… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical dep… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical dep… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical dep… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical dep… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical dep… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical dep… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical dep… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical sto… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical sto… NA 40 1002
## # … with 19,056 more rows, 10 more variables:
## # tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>,
## # lat_Max <dbl>, lat_Min <dbl>, long_Max <dbl>, long_Min <dbl>,
## # pressure_Max <int>, pressure_Min <int>,
## # tropicalstorm_force_diameter_Max <int>,
## # tropicalstorm_force_diameter_Min <int>, and abbreviated variable names
## # ¹category, ²pressure
Let’s create a column appended to the original dataframe that gives the maximum windspeed, storm name and its status, per year. Then calculate each day’s max speed as a percentage of the max speed reached for that year.
left_join(
dplyr::storms,
dplyr::storms %>% group_by(year) %>% filter(wind == max(wind)) %>%
select(year, name, wind, status ) %>% ungroup() %>%
mutate(MaxSpeed = wind, Max_Wind_Speed_Msg = paste0("max status reached: ", status)) %>%
select(-status, -wind, -name) %>% unique,
by = c("year")
) %>%
mutate(Pct_of_Max = paste0( round( wind / MaxSpeed, 2) * 100, "%" ))
## # A tibble: 19,066 × 16
## name year month day hour lat long status categ…¹ wind press…²
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical dep… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical dep… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical dep… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical dep… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical dep… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical dep… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical dep… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical dep… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical sto… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical sto… NA 40 1002
## # … with 19,056 more rows, 5 more variables:
## # tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>,
## # MaxSpeed <int>, Max_Wind_Speed_Msg <chr>, Pct_of_Max <chr>, and abbreviated
## # variable names ¹category, ²pressure
How super powerful are these tools right? We will see more example of things we can do later.
Nice for adjusting strings:
str_sub(): Extract substrings from a character vector
str_trim():Trim white spaces
str_length(): Checks the length of the string
str_to_lower/str_to_upper: Converts the string into upper case or lower case
Let’s try out some more fun data wrangling examples…. hope you are getting excited!
Let’s calculate a few things, per month, for several crypto currencies.
Tip: using format for dates here… format(date, “%b”) gives e.g the month as Feb. %B gives February, while %m gives 02.
Now let’s crypto…
Cryptos <- fmxdat::cryptos
Crypto_Adj <-
Cryptos %>%
mutate(YearMonth = format(date, "%b%y")) %>%
group_by(YearMonth) %>%
mutate(LastDate = last(date)) %>%
group_by(LastDate, name, YearMonth) %>%
summarise(
Monthly_Return = last(close) / first(close) - 1,
hilo = max(high) - min(low),
N_Cryptos = n())
print(Crypto_Adj)
## # A tibble: 1,631 × 6
## # Groups: LastDate, name [1,631]
## LastDate name YearMonth Monthly_Return hilo N_Cryptos
## <date> <chr> <chr> <dbl> <dbl> <int>
## 1 2013-04-30 Bitcoin Apr13 0.0357 15.4 3
## 2 2013-04-30 Litecoin Apr13 -0.0115 0.400 3
## 3 2013-05-31 Bitcoin May13 0.103 60.8 31
## 4 2013-05-31 Litecoin May13 -0.221 1.96 31
## 5 2013-06-30 Bitcoin Jun13 -0.253 41.3 30
## 6 2013-06-30 Litecoin Jun13 0.0530 1.56 30
## 7 2013-07-31 Bitcoin Jul13 0.205 45.8 31
## 8 2013-07-31 Litecoin Jul13 -0.0271 0.97 31
## 9 2013-08-31 Bitcoin Aug13 0.301 39.7 31
## 10 2013-08-31 Litecoin Aug13 -0.113 0.63 31
## # … with 1,621 more rows
Crypto_Adj %>%
group_by(YearMonth) %>%
mutate(Rank = rank(Monthly_Return) ) %>%
group_by(name) %>%
summarise( Pct_Top_Three = sum( ifelse( LastDate >= lubridate::ymd(20140101) &
Rank %in% c(1, 2, 3), 1, 0 ) ) / n()) %>%
top_n(5, Pct_Top_Three) %>%
arrange(desc(Pct_Top_Three))
## # A tibble: 5 × 2
## name Pct_Top_Three
## <chr> <dbl>
## 1 Algorand 0.375
## 2 MINDOL 0.364
## 3 ThoreNext 0.333
## 4 Ravencoin 0.261
## 5 Bitcoin Gold 0.25
Now, see if you can calculate the coin with the highest return since 2018 for coins that listed before March 2016. Go!
Provided with the following dataframe: fmxdat::Fin_Data_SA_US_NKY, let’s:
fmxdat::Fin_Data_SA_US_NKY %>%
group_by(date, Universe) %>%
mutate(Market_Weight = Market.Cap / sum(Market.Cap)) %>%
group_by(date, Universe, BICS_LEVEL_1_SECTOR_NAME) %>%
mutate(Index_Weight = Market.Cap / sum(Market.Cap)) %>% arrange(date)
## # A tibble: 35,704 × 10
## # Groups: date, Universe, BICS_LEVEL_1_SECTOR_NAME [7,641]
## date Tickers TRI Short…¹ Marke…² BICS_…³ BICS_…⁴ Unive…⁵ Marke…⁶
## <date> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2017-01-02 ACT SJ Equ… 1.17 AFROCE… 2.83e 8 Health… Health… JALSHA… 0.00132
## 2 2017-01-02 ANG SJ Equ… 16.4 ANGLOG… 5.15e 9 Materi… Metals… JALSHA… 0.0241
## 3 2017-01-02 AWA SJ Equ… 0.967 ARROWH… 6.85e 8 Financ… Real E… JALSHA… 0.00321
## 4 2017-01-02 BID SJ Equ… 18.0 BID CO… 5.77e 9 Consum… Commer… JALSHA… 0.0270
## 5 2017-01-02 BIL SJ Equ… 28.8 BHP BI… 1.03e11 Materi… Iron &… JALSHA… 0.481
## 6 2017-01-02 CLH SJ Equ… 28.9 CITY L… 4.76e 8 Consum… Gaming… JALSHA… 0.00223
## 7 2017-01-02 CLS SJ Equ… 14.9 CLICKS… 2.22e 9 Consum… Retail… JALSHA… 0.0104
## 8 2017-01-02 COH SJ Equ… 3.20 CURRO … 1.44e 9 Consum… Consum… JALSHA… 0.00674
## 9 2017-01-02 CRP SJ Equ… 0.690 CAP & … 5.06e 8 Financ… Real E… JALSHA… 0.00237
## 10 2017-01-02 EPP SJ Equ… 1.42 ECHO P… 8.48e 8 Financ… Real E… JALSHA… 0.00397
## # … with 35,694 more rows, 1 more variable: Index_Weight <dbl>, and abbreviated
## # variable names ¹Short.Name, ²Market.Cap, ³BICS_LEVEL_1_SECTOR_NAME,
## # ⁴BICS_LEVEL_2_INDUSTRY_GROUP_NAME, ⁵Universe, ⁶Market_Weight
fmxdat::Fin_Data_SA_US_NKY %>%
arrange(date) %>%
group_by(Tickers) %>%
mutate(Return = TRI / lag(TRI) - 1) %>% filter(date > first(date)) %>%
filter(date >= lubridate::ymd(20170701)) %>%
group_by(Universe, BICS_LEVEL_1_SECTOR_NAME, Tickers) %>%
summarise(Sharpe = mean(Return, na.rm = T) / sd(Return, na.rm = T)) %>%
group_by(Universe, BICS_LEVEL_1_SECTOR_NAME) %>%
filter(Sharpe == max(Sharpe, na.rm=T)) %>% ungroup()
## # A tibble: 27 × 4
## Universe BICS_LEVEL_1_SECTOR_NAME Tickers Sharpe
## <chr> <chr> <chr> <dbl>
## 1 JALSHAll Communications VOD SJ Equity 0.0483
## 2 JALSHAll Consumer Discretionary CLH SJ Equity 0.0590
## 3 JALSHAll Consumer Staples CLS SJ Equity 0.110
## 4 JALSHAll Energy EXX SJ Equity 0.162
## 5 JALSHAll Financials TTO SJ Equity 0.116
## 6 JALSHAll Health Care ACT SJ Equity 0.0254
## 7 JALSHAll Industrials GRF SJ Equity -0.0566
## 8 JALSHAll Materials KIO SJ Equity 0.221
## 9 NKY Communications 9433 JP Equity -0.0152
## 10 NKY Consumer Discretionary 7269 JP Equity 0.0940
## # … with 17 more rows
Now, let’s have some more fun while we exercise our wrangling skills.
Here is the entire transcript of the series Friends:
With this, we now have access to the lines in the series:
## # A tibble: 67,373 × 6
## text speaker season episode scene utter…¹
## <chr> <chr> <int> <int> <int> <int>
## 1 There's nothing to tell! He's just some… Monica… 1 1 1 1
## 2 C'mon, you're going out with the guy! T… Joey T… 1 1 1 2
## 3 All right Joey, be nice. So does he hav… Chandl… 1 1 1 3
## 4 Wait, does he eat chalk? Phoebe… 1 1 1 4
## 5 (They all stare, bemused.) Scene … 1 1 1 5
## 6 Just, 'cause, I don't want her to go th… Phoebe… 1 1 1 6
## 7 Okay, everybody relax. This is not even… Monica… 1 1 1 7
## 8 Sounds like a date to me. Chandl… 1 1 1 8
## 9 [Time Lapse] Scene … 1 1 1 9
## 10 Alright, so I'm back in high school, I'… Chandl… 1 1 1 10
## # … with 67,363 more rows, and abbreviated variable name ¹utterance
Some more author details - e.g. scene directions (or some more non-spoken descriptions):
## # A tibble: 12,606 × 5
## season episode scene utterance emotion
## <int> <int> <int> <int> <chr>
## 1 1 1 4 1 Mad
## 2 1 1 4 3 Neutral
## 3 1 1 4 4 Joyful
## 4 1 1 4 5 Neutral
## 5 1 1 4 6 Neutral
## 6 1 1 4 7 Neutral
## 7 1 1 4 8 Scared
## 8 1 1 4 10 Joyful
## 9 1 1 4 11 Joyful
## 10 1 1 4 12 Sad
## # … with 12,596 more rows
As well as ratings information for the series:
## # A tibble: 236 × 8
## season episode title direc…¹ writt…² air_date us_vi…³ imdb_…⁴
## <int> <int> <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 1 1 The Pilot James … David … 1994-09-22 21.5 8.3
## 2 1 2 The One with the S… James … David … 1994-09-29 20.2 8.1
## 3 1 3 The One with the T… James … Jeffre… 1994-10-06 19.5 8.2
## 4 1 4 The One with Georg… James … Alexa … 1994-10-13 19.7 8.1
## 5 1 5 The One with the E… Pamela… Jeff G… 1994-10-20 18.6 8.5
## 6 1 6 The One with the B… Arlene… Adam C… 1994-10-27 18.2 8.1
## 7 1 7 The One with the B… James … Jeffre… 1994-11-03 23.5 9
## 8 1 8 The One Where Nana… James … Marta … 1994-11-10 21.1 8.1
## 9 1 9 The One Where Unde… James … Jeff G… 1994-11-17 23.1 8.2
## 10 1 10 The One with the M… Peter … Adam C… 1994-12-15 19.9 8.1
## # … with 226 more rows, and abbreviated variable names ¹directed_by,
## # ²written_by, ³us_views_millions, ⁴imdb_rating
Play around with this, and learn more of the series friends while practicing your new found skill set..
In the next session, we will expand the tidyverse into tidy plotting using ggplot.
To give you a taste for this, see the plot for the viewership and ratings over time:
friends_info %>%
group_by(season) %>% summarise(mean_rat = mean(imdb_rating)) %>%
ggplot() + geom_line(aes(season, mean_rat), color = "steelblue", size = 1.2, alpha = 0.8) +
scale_x_continuous(breaks = c(1:length(unique(friends_info$season)))) +
theme_bw() +
labs(x = "Season", y = "Avg Rating", caption = "Calculated using the Friends package in R",
title = "Friends Avg. Ratings Per Season")
# US Viewership (millions):
df_plot <-
friends_info %>%
mutate(Text = ifelse(imdb_rating == min(imdb_rating, na.rm=T), glue::glue("Lowest: S{season}, ep{episode} ({imdb_rating})\n{title}"),
ifelse(imdb_rating == max(imdb_rating, na.rm=T), glue::glue("Highest: S{season}, ep{episode} ({imdb_rating})\n{title}"),
NA_character_))) %>% mutate(season = as.character(season))
# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))
df_plot %>%
ggplot() +
geom_boxplot(aes(x = season, y = imdb_rating, fill = season), alpha = 0.4) +
geom_jitter(aes(season, imdb_rating, color = season), size = 3, alpha = 0.8) +
scale_y_continuous(limits = c(7, 10), breaks = scales::breaks_pretty(n = 6)) +
ggrepel::geom_text_repel(aes(x = season, y = imdb_rating, label = Text),
force = T ) +
theme_bw() +
labs(x = "Season", y = "IMDP Ratings", caption = "Calculated using the Friends package in R",
title = "Friends Ratings Per Episode") +
scale_fill_hue(l=40, c=35) +
scale_color_hue(l=40, c=35) +
guides(fill = F, color = F)
# Viewership:
df_plot <-
friends_info %>%
mutate(Text = ifelse(us_views_millions == min(us_views_millions, na.rm=T), glue::glue("Worst: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"),
ifelse(us_views_millions == max(us_views_millions, na.rm=T), glue::glue("Best: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"),
NA_character_))) %>% mutate(season = as.character(season))
# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))
df_plot %>%
ggplot() +
geom_boxplot(aes(x = season, y = us_views_millions, fill = season), alpha = 0.4) +
geom_jitter(aes(season, us_views_millions, color = season), size = 3, alpha = 0.8) +
scale_y_continuous(limits = c(15, 60), breaks = scales::breaks_pretty(n = 10)) +
ggrepel::geom_text_repel(aes(x = season, y = us_views_millions, label = Text),
force = T ) +
theme_bw() +
labs(x = "Season", y = "Views (mn)", caption = "Calculated using the Friends package in R",
title = "Friends US Views Per Episode") +
scale_fill_hue(l=40, c=35) +
scale_color_hue(l=40, c=35) +
guides(fill = F, color = F)
Typical American viewership - the one with the superbowl…
Hope you enjoyed this session.