The aim of this tutorial is to introduce you to the tidyverse.
While there are quite a few references offering some more background, I recommend consulting these for further reading:
It is effectively a reimagining of data analytics by requiring your data to be in a tidy format.
Once your data is tidy, analyzing and working with it is simpler, more stable and more efficient - and you will never need to type out clumsy for-loops again (at least in the traditional sense of the word).
Tidy data means each variable is in its own column, with each observation being in its own row.
We’ll get to what this means in examples later on.
Let’s first discuss the tidyverse:
The tidyverse package is a collection of various other packages, including:
dplyr : awesome for data wrangling
readr : awesome for importing data
ggplot2 : the standard for plotting in a tidy framework
tidyr : essential for manipulating dataframes
purrr : makes your code (and you) purrr with delight… you’ll see. Absolute synctatic sugar. Pure elation.
A key command in using the tidyverse is the piping call ‘%>%’.
x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x,y)
We can use a dot to refer to the data input: x %>% fun(.) is equivalent to f(x)
Type CTRL SHIFT M in Rstudio, and you will see the pipe appear. This will be your friend from now on.
‘%>%’ should be thought of and ‘read’ as ‘then’.
Thus, X %>% f(…) %>% g(…) should be read as X then apply f(…) then apply g(…)
This makes your code much more readable (going from left to right) and easier to manage.
Let’s see it in action quick using the ‘select’ command:
## # A tibble: 790 × 3
## Date brz chl
## <date> <dbl> <dbl>
## 1 2000-01-14 1420. NA
## 2 2000-01-21 1382. NA
## 3 2000-01-28 1344. NA
## 4 2000-02-04 1426. 1522.
## 5 2000-02-11 1416. 1544.
## 6 2000-02-18 1372. 1443.
## 7 2000-02-25 1385. 1380.
## 8 2000-03-03 1439. 1456.
## 9 2000-03-10 1424. 1421.
## 10 2000-03-17 1383. 1419.
## # ℹ 780 more rows
The cornerstone of the tidyverse is to make your data, well, tidy.
This means:
It is time to reboot how you think about dataframes.
dplyr requires you to think in a tidy way about your data - in particular, that your data should be in a long format, as opposed to wide.
The thinking behind dplyr is to oranize your data in such a way that it makes analysis simpler (much simpler).
The idea behind tidy data can be described as follows:
If we think about our earlier dataframe - making it tidy implies:
Look at the figure above - what this now does is allow us to easily apply the same functions on each Country sequentially.
Examples later will illustrate the ease with which this can be done.
Naturally this requires us to decide what to call the gather group, what to call the values so gathered, and which columns to gather together.
## # A tibble: 16,590 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 chl NA
## 3 2000-01-14 chn 36.7
## 4 2000-01-14 col 105.
## 5 2000-01-14 cze 91.7
## 6 2000-01-14 egt 352.
## 7 2000-01-14 grc 1215.
## 8 2000-01-14 hun 369.
## 9 2000-01-14 ind 175.
## 10 2000-01-14 ino 280.
## # ℹ 16,580 more rows
Now why would this be a preferable format?
Consider now the ease with which we could, e.g., calculate the:
## # A tibble: 16,590 × 4
## Date Country val Returns
## <date> <chr> <dbl> <dbl>
## 1 2000-01-14 brz 1420. NA
## 2 2000-01-14 chl NA NA
## 3 2000-01-14 chn 36.7 NA
## 4 2000-01-14 col 105. NA
## 5 2000-01-14 cze 91.7 NA
## 6 2000-01-14 egt 352. NA
## 7 2000-01-14 grc 1215. NA
## 8 2000-01-14 hun 369. NA
## 9 2000-01-14 ind 175. NA
## 10 2000-01-14 ino 280. NA
## # ℹ 16,580 more rows
Notice my use of the verb ‘mutate’ above.
Mutate adds a column by applying the function given. The idea can be summarised as:
df = mutate( new_column_name = f ( Exiting_column) )
## # A tibble: 16,590 × 5
## Date Country val Returns Std_Returns
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2000-01-14 brz 1420. NA NA
## 2 2000-01-14 chl NA NA NA
## 3 2000-01-14 chn 36.7 NA NA
## 4 2000-01-14 col 105. NA NA
## 5 2000-01-14 cze 91.7 NA NA
## 6 2000-01-14 egt 352. NA NA
## 7 2000-01-14 grc 1215. NA NA
## 8 2000-01-14 hun 369. NA NA
## 9 2000-01-14 ind 175. NA NA
## 10 2000-01-14 ino 280. NA NA
## # ℹ 16,580 more rows
‘filtering’ is another useful verb. It allows easy subsetting of your data. E.g. let’s ‘filter’ out the first date of our dataframe above (which has NA returns):
## # A tibble: 16,569 × 4
## Date Country val Returns
## <date> <chr> <dbl> <dbl>
## 1 2000-01-21 brz 1382. -0.0267
## 2 2000-01-21 chl NA NA
## 3 2000-01-21 chn 36.1 -0.0172
## 4 2000-01-21 col 102. -0.0281
## 5 2000-01-21 cze 98.3 0.0722
## 6 2000-01-21 egt 375. 0.0656
## 7 2000-01-21 grc 1177. -0.0320
## 8 2000-01-21 hun 379. 0.0271
## 9 2000-01-21 ind 179. 0.0232
## 10 2000-01-21 ino 261. -0.0670
## # ℹ 16,559 more rows
Filter can also be used nicely with vector aruments…
## # A tibble: 1,580 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 zar 257.
## 3 2000-01-21 brz 1382.
## 4 2000-01-21 zar 254.
## 5 2000-01-28 brz 1344.
## 6 2000-01-28 zar 229.
## 7 2000-02-04 brz 1426.
## 8 2000-02-04 zar 238.
## 9 2000-02-11 brz 1416.
## 10 2000-02-11 zar 239.
## # ℹ 1,570 more rows
…grepls (e.g. countries with a z or a b in their names)….
## # A tibble: 2,370 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 cze 91.7
## 3 2000-01-14 zar 257.
## 4 2000-01-21 brz 1382.
## 5 2000-01-21 cze 98.3
## 6 2000-01-21 zar 254.
## 7 2000-01-28 brz 1344.
## 8 2000-01-28 cze 97.1
## 9 2000-01-28 zar 229.
## 10 2000-02-04 brz 1426.
## # ℹ 2,360 more rows
… dates (notice lubridate package)
## # A tibble: 15,519 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2001-01-05 brz 1286.
## 2 2001-01-05 chl 1202.
## 3 2001-01-05 chn 26.8
## 4 2001-01-05 col 55.8
## 5 2001-01-05 cze 91.0
## 6 2001-01-05 egt 187.
## 7 2001-01-05 grc 700.
## 8 2001-01-05 hun 257.
## 9 2001-01-05 ind 136.
## 10 2001-01-05 ino 111.
## # ℹ 15,509 more rows
## # A tibble: 189 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2015-01-02 brz 4980.
## 2 2015-01-02 chl 4381.
## 3 2015-01-02 chn 115.
## 4 2015-01-02 col 1745.
## 5 2015-01-02 cze 749.
## 6 2015-01-02 egt 1921.
## 7 2015-01-02 grc 171.
## 8 2015-01-02 hun 478.
## 9 2015-01-02 ind 707.
## 10 2015-01-02 ino 1727.
## # ℹ 179 more rows
… values
## # A tibble: 5,512 × 3
## Date Country val
## <date> <chr> <dbl>
## 1 2000-01-14 brz 1420.
## 2 2000-01-14 grc 1215.
## 3 2000-01-14 mex 2377.
## 4 2000-01-14 tur 1048.
## 5 2000-01-21 brz 1382.
## 6 2000-01-21 grc 1177.
## 7 2000-01-21 mex 2216.
## 8 2000-01-28 brz 1344.
## 9 2000-01-28 grc 1125.
## 10 2000-01-28 mex 2106.
## # ℹ 5,502 more rows
How. Easy. Is. That.
Almost similar to mutate - but with the added feature of collapsing your data according to the function provided.
Best illustrated with example:
## # A tibble: 16,590 × 4
## # Groups: Country [21]
## Date Country val Max
## <date> <chr> <dbl> <dbl>
## 1 2000-01-14 brz 1420. 10340.
## 2 2000-01-14 chl NA 7483.
## 3 2000-01-14 chn 36.7 140.
## 4 2000-01-14 col 105. 2881.
## 5 2000-01-14 cze 91.7 1346.
## 6 2000-01-14 egt 352. 2675.
## 7 2000-01-14 grc 1215. 1975.
## 8 2000-01-14 hun 369. 1595.
## 9 2000-01-14 ind 175. 885.
## 10 2000-01-14 ino 280. 2072.
## # ℹ 16,580 more rows
## # A tibble: 21 × 2
## Country Max
## <chr> <dbl>
## 1 brz 10340.
## 2 chl 7483.
## 3 chn 140.
## 4 col 2881.
## 5 cze 1346.
## 6 egt 2675.
## 7 grc 1975.
## 8 hun 1595.
## 9 ind 885.
## 10 ino 2072.
## # ℹ 11 more rows
Notice in the second chunk, summarise only returns the max values by the groups asked. Mutate adds a column, repeating the max values at each date.
Joining dataframes is key to good analytics.
Tidy joining is typically done using left_join or full_join by some common column.
Let’s work out an example to show this.
Suppose your tidydat was stored as two excel files with the following formats:
## # A tibble: 790 × 22
## Date brz chl chn col cze egt grc hun ind ino kor
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-14 1420. NA 36.7 105. 91.7 352. 1215. 369. 175. 280. 192.
## 2 2000-01-21 1382. NA 36.1 102. 98.3 375. 1177. 379. 179. 261. 186.
## 3 2000-01-28 1344. NA 36.0 104. 97.1 354. 1125. 379. 187. 246. 191.
## 4 2000-02-04 1426. 1522. 35.5 105. 104. 349. 1202. 370. 198. 239. 189.
## 5 2000-02-11 1416. 1544. 31.8 104. 114. 359. 1257. 396. 220. 249. 188.
## 6 2000-02-18 1372. 1443. 33.2 97.7 114. 350. 1209. 402. 243. 231. 172.
## 7 2000-02-25 1385. 1380. 29.7 90.3 112. 322. 1195. 405. 222. 219. 168.
## 8 2000-03-03 1439. 1456. 28.5 88.2 117. 327. 1164. 401. 217. 208. 177.
## 9 2000-03-10 1424. 1421. 29.8 88.4 117. 327. 1111. 425. 211. 234. 174.
## 10 2000-03-17 1383. 1419. 29.9 87.9 118. 328. 1099. 399. 189. 230. 169.
## # ℹ 780 more rows
## # ℹ 10 more variables: mal <dbl>, mex <dbl>, per <dbl>, phi <dbl>, pol <dbl>,
## # rus <dbl>, zar <dbl>, twn <dbl>, tha <dbl>, tur <dbl>
## # A tibble: 9 × 2
## Country Label
## <chr> <chr>
## 1 brz Brics
## 2 chn Brics
## 3 zar Brics
## 4 ind Brics
## 5 rus Brics
## 6 mex Mint
## 7 tur Mint
## 8 hun HEGP
## 9 egt HEGP
Notice now that combining this information to use it to do e.g. calcs by BRICS group woudl’ve otherwise been onerous.
Now it is simple using ‘left join’ - to join the Label from the second dataframe to the first:
Notice that ‘joining’ allocates Label by Country, replacing non-entry Labels with NA (e.g. chl, col and cze had no label, thus get NA labels)
In the above example full_join and left_join would produce equivalent results, but of course if the ‘right’ dataframe (Indexes_Labs) had other information (like dates) too - it would’ve ensured all data is preserved.
left_join, however, will join from left to right (and discard e.g. dates in the right df from joining to the left. Full join will join all dates)
Let’s now use these verbs on a wide dataframe to do some nice things.
First, our dataset is wide (which should start bothering you):
## # A tibble: 12 × 15
## date SubFactors `AAA SJ Equity` `BBB SJ Equity` `CCC SJ Equity`
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2002-01-31 EBIT.EV 15.8 7.78 14.8
## 2 2002-01-31 FCF.EV 10.5 6.78 12.3
## 3 2002-01-31 ROA 13.4 15.7 12.6
## 4 2002-01-31 ROE 3.22 15.9 11.9
## 5 2002-01-31 TRR6M1M 13.4 14.4 9.76
## 6 2002-01-31 Volat.D.60 11.4 9.82 12.5
## 7 2002-02-28 EBIT.EV 5.83 10.7 10.3
## 8 2002-02-28 FCF.EV 11.6 9.17 10.5
## 9 2002-02-28 ROA 4.32 11.5 10.0
## 10 2002-02-28 ROE 15.0 13.6 13.0
## 11 2002-02-28 TRR6M1M 13.7 9.32 10.6
## 12 2002-02-28 Volat.D.60 6.89 7.87 13.1
## # ℹ 10 more variables: `DDD SJ Equity` <dbl>, `EEE SJ Equity` <dbl>,
## # `FFF SJ Equity` <dbl>, `TTT SPGLOB Equity` <dbl>,
## # `UUU SPGLOB Equity` <dbl>, `VVV SPGLOB Equity` <dbl>,
## # `XXX SPGLOB Equity` <dbl>, `YYY SPGLOB Equity` <dbl>,
## # `ZZZ SPGLOB Equity` <dbl>, AnotherColumn <chr>
Before we gather - let’s plan.
First, our columns suggest two types of stocks - some are from SA (ending with ’ SJ Equity’), while others are US (ending with ’ SPGLOB Equity’).
Second, the values correspond to the SubFactors columm.
Let’s gather these by Stocks generally first, after which we create a column where the countries are explicitly labelled.
To do so, we need to exploit the pattern in the column names. Here we can use the adverbs contains, starts_with, ends_with, one_of, etc.
Let’s use ‘contains’ and ‘Equity’ - which will suffice as it gathers all the stocks (and ignores ‘Date’, ‘AnotherColumn’, and ‘SubFactors’:
## # A tibble: 144 × 5
## date SubFactors AnotherColumn Stocks Value
## <date> <chr> <chr> <chr> <dbl>
## 1 2002-01-31 EBIT.EV Random AAA SJ Equity 15.8
## 2 2002-01-31 FCF.EV Random AAA SJ Equity 10.5
## 3 2002-01-31 ROA Random AAA SJ Equity 13.4
## 4 2002-01-31 ROE Random AAA SJ Equity 3.22
## 5 2002-01-31 TRR6M1M Random AAA SJ Equity 13.4
## 6 2002-01-31 Volat.D.60 Random AAA SJ Equity 11.4
## 7 2002-02-28 EBIT.EV Random AAA SJ Equity 5.83
## 8 2002-02-28 FCF.EV Random AAA SJ Equity 11.6
## 9 2002-02-28 ROA Random AAA SJ Equity 4.32
## 10 2002-02-28 ROE Random AAA SJ Equity 15.0
## # ℹ 134 more rows
Awesome. Let’s now create a column labelling the country based on SJ and SPGLOB using grepl, after which we gsub out the ’ SJ Equity’ and ’ SPGLOB Equity’ parts (see how nice these now play along with mutate…..):
## # A tibble: 144 × 6
## date SubFactors AnotherColumn Stocks Value Country
## <date> <chr> <chr> <chr> <dbl> <chr>
## 1 2002-01-31 EBIT.EV Random AAA 15.8 SA
## 2 2002-01-31 FCF.EV Random AAA 10.5 SA
## 3 2002-01-31 ROA Random AAA 13.4 SA
## 4 2002-01-31 ROE Random AAA 3.22 SA
## 5 2002-01-31 TRR6M1M Random AAA 13.4 SA
## 6 2002-01-31 Volat.D.60 Random AAA 11.4 SA
## 7 2002-02-28 EBIT.EV Random AAA 5.83 SA
## 8 2002-02-28 FCF.EV Random AAA 11.6 SA
## 9 2002-02-28 ROA Random AAA 4.32 SA
## 10 2002-02-28 ROE Random AAA 15.0 SA
## # ℹ 134 more rows
This takes a bit of time getting used to - so practice this a bit.
If you want to go back to a wide format, spread is the opposite of gather.
E.g. let’s spread all the ROA and ROE values for SA stocks (notice my use of %in% below….):
## # A tibble: 12 × 6
## date AnotherColumn Stocks Country ROA ROE
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2002-01-31 Random AAA SA 13.4 3.22
## 2 2002-01-31 Random BBB SA 15.7 15.9
## 3 2002-01-31 Random CCC SA 12.6 11.9
## 4 2002-01-31 Random DDD SA 9.71 10.9
## 5 2002-01-31 Random EEE SA 10.5 11.9
## 6 2002-01-31 Random FFF SA 8.55 14.3
## 7 2002-02-28 Random AAA SA 4.32 15.0
## 8 2002-02-28 Random BBB SA 11.5 13.6
## 9 2002-02-28 Random CCC SA 10.0 13.0
## 10 2002-02-28 Random DDD SA 13.0 11.8
## 11 2002-02-28 Random EEE SA 10.2 20.0
## 12 2002-02-28 Random FFF SA 9.52 1.98
There are many many many combinations of these core verbs that allow efficient data analytics - sometimes with the help of some auxilliary verbs.
E.g., let’s use this dplyr native dataframe: ‘dplyr::storms’ to do some nice wrangling from which you can learn a few tricks… (I will add the tricks in brackets before using it)
## # A tibble: 19,537 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
Let’s see which statusses are available (status, unique):
## [1] tropical depression tropical storm extratropical
## [4] hurricane subtropical storm subtropical depression
## [7] disturbance other low tropical wave
## 9 Levels: disturbance extratropical hurricane ... tropical wave
Let’s replace NA values in hurricane_force_diameter with something elaborate: the ‘lat’ column’s minimum values times the ‘long’ column’s max values (coalesce):
## # A tibble: 19,537 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <dbl>
Let’s apply the same function on several columns using across:
## # A tibble: 19,537 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 70.7 13.5 tropical d… NA 25 1024
## 2 Amy 1975 6 27 6 70.7 13.5 tropical d… NA 25 1024
## 3 Amy 1975 6 27 12 70.7 13.5 tropical d… NA 25 1024
## 4 Amy 1975 6 27 18 70.7 13.5 tropical d… NA 25 1024
## 5 Amy 1975 6 28 0 70.7 13.5 tropical d… NA 25 1024
## 6 Amy 1975 6 28 6 70.7 13.5 tropical d… NA 25 1024
## 7 Amy 1975 6 28 12 70.7 13.5 tropical d… NA 25 1024
## 8 Amy 1975 6 28 18 70.7 13.5 tropical d… NA 30 1024
## 9 Amy 1975 6 29 0 70.7 13.5 tropical s… NA 35 1024
## 10 Amy 1975 6 29 6 70.7 13.5 tropical s… NA 40 1024
## # ℹ 19,527 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
## # A tibble: 1 × 3
## tropicalstorm_force_diameter hurricane_force_diameter pressure
## <int> <int> <int>
## 1 1440 300 1024
Let’s apply several functions on several columns - naming each accordingly using a suffix
## # A tibble: 19,537 × 21
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 10 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>, lat_Max <dbl>, lat_Min <dbl>,
## # long_Max <dbl>, long_Min <dbl>, pressure_Max <int>, pressure_Min <int>,
## # tropicalstorm_force_diameter_Max <int>,
## # tropicalstorm_force_diameter_Min <int>
Let’s create a column appended to the original dataframe that gives the maximum windspeed, storm name and its status, per year. Then calculate each day’s max speed as a percentage of the max speed reached for that year.
left_join(
dplyr::storms,
dplyr::storms %>% group_by(year) %>% filter(wind == max(wind)) %>%
select(year, name, wind, status ) %>% ungroup() %>%
mutate(MaxSpeed = wind, Max_Wind_Speed_Msg = paste0("max status reached: ", status)) %>%
select(-status, -wind, -name) %>% unique,
by = c("year")
) %>%
mutate(Pct_of_Max = paste0( round( wind / MaxSpeed, 2) * 100, "%" ))
## # A tibble: 19,537 × 16
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,527 more rows
## # ℹ 5 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>, MaxSpeed <int>, Max_Wind_Speed_Msg <chr>,
## # Pct_of_Max <chr>
How super powerful are these tools right? We will see more example of things we can do later.
Nice for adjusting strings:
str_sub(): Extract substrings from a character vector
str_trim():Trim white spaces
str_length(): Checks the length of the string
str_to_lower/str_to_upper: Converts the string into upper case or lower case
Let’s try out some more fun data wrangling examples…. hope you are getting excited!
Let’s calculate a few things, per month, for several crypto currencies.
Tip: using format for dates here… format(date, “%b”) gives e.g the month as Feb. %B gives February, while %m gives 02.
Now let’s crypto…
Cryptos <- fmxdat::cryptos
Crypto_Adj <-
Cryptos %>%
mutate(YearMonth = format(date, "%b%y")) %>%
group_by(YearMonth) %>%
mutate(LastDate = last(date)) %>%
group_by(LastDate, name, YearMonth) %>%
summarise(
Monthly_Return = last(close) / first(close) - 1,
hilo = max(high) - min(low),
N_Cryptos = n())
print(Crypto_Adj)
## # A tibble: 1,631 × 6
## # Groups: LastDate, name [1,631]
## LastDate name YearMonth Monthly_Return hilo N_Cryptos
## <date> <chr> <chr> <dbl> <dbl> <int>
## 1 2013-04-30 Bitcoin Apr13 0.0357 15.4 3
## 2 2013-04-30 Litecoin Apr13 -0.0115 0.400 3
## 3 2013-05-31 Bitcoin May13 0.103 60.8 31
## 4 2013-05-31 Litecoin May13 -0.221 1.96 31
## 5 2013-06-30 Bitcoin Jun13 -0.253 41.3 30
## 6 2013-06-30 Litecoin Jun13 0.0530 1.56 30
## 7 2013-07-31 Bitcoin Jul13 0.205 45.8 31
## 8 2013-07-31 Litecoin Jul13 -0.0271 0.97 31
## 9 2013-08-31 Bitcoin Aug13 0.301 39.7 31
## 10 2013-08-31 Litecoin Aug13 -0.113 0.63 31
## # ℹ 1,621 more rows
## # A tibble: 5 × 2
## name Pct_Top_Three
## <chr> <dbl>
## 1 Algorand 0.375
## 2 MINDOL 0.364
## 3 ThoreNext 0.333
## 4 Ravencoin 0.261
## 5 Bitcoin Gold 0.25
Now, see if you can calculate the coin with the highest return since 2018 for coins that listed before March 2016. Go!
Provided with the following dataframe: fmxdat::Fin_Data_SA_US_NKY, let’s:
## # A tibble: 35,704 × 10
## # Groups: date, Universe, BICS_LEVEL_1_SECTOR_NAME [7,641]
## date Tickers TRI Short.Name Market.Cap BICS_LEVEL_1_SECTOR_…¹
## <date> <chr> <dbl> <chr> <dbl> <chr>
## 1 2017-01-02 ACT SJ Equity 1.17 AFROCENTRI… 2.83e 8 Health Care
## 2 2017-01-02 ANG SJ Equity 16.4 ANGLOGOLD … 5.15e 9 Materials
## 3 2017-01-02 AWA SJ Equity 0.967 ARROWHEAD … 6.85e 8 Financials
## 4 2017-01-02 BID SJ Equity 18.0 BID CORP L… 5.77e 9 Consumer Discretionary
## 5 2017-01-02 BIL SJ Equity 28.8 BHP BILLIT… 1.03e11 Materials
## 6 2017-01-02 CLH SJ Equity 28.9 CITY LODGE… 4.76e 8 Consumer Discretionary
## 7 2017-01-02 CLS SJ Equity 14.9 CLICKS GRO… 2.22e 9 Consumer Staples
## 8 2017-01-02 COH SJ Equity 3.20 CURRO HOLD… 1.44e 9 Consumer Discretionary
## 9 2017-01-02 CRP SJ Equity 0.690 CAP & REGI… 5.06e 8 Financials
## 10 2017-01-02 EPP SJ Equity 1.42 ECHO POLSK… 8.48e 8 Financials
## # ℹ 35,694 more rows
## # ℹ abbreviated name: ¹BICS_LEVEL_1_SECTOR_NAME
## # ℹ 4 more variables: BICS_LEVEL_2_INDUSTRY_GROUP_NAME <chr>, Universe <chr>,
## # Market_Weight <dbl>, Index_Weight <dbl>
fmxdat::Fin_Data_SA_US_NKY %>%
arrange(date) %>%
group_by(Tickers) %>%
mutate(Return = TRI / lag(TRI) - 1) %>% filter(date > first(date)) %>%
filter(date >= lubridate::ymd(20170701)) %>%
group_by(Universe, BICS_LEVEL_1_SECTOR_NAME, Tickers) %>%
summarise(Sharpe = mean(Return, na.rm = T) / sd(Return, na.rm = T)) %>%
group_by(Universe, BICS_LEVEL_1_SECTOR_NAME) %>%
filter(Sharpe == max(Sharpe, na.rm=T)) %>% ungroup()
## # A tibble: 27 × 4
## Universe BICS_LEVEL_1_SECTOR_NAME Tickers Sharpe
## <chr> <chr> <chr> <dbl>
## 1 JALSHAll Communications VOD SJ Equity 0.0483
## 2 JALSHAll Consumer Discretionary CLH SJ Equity 0.0590
## 3 JALSHAll Consumer Staples CLS SJ Equity 0.110
## 4 JALSHAll Energy EXX SJ Equity 0.162
## 5 JALSHAll Financials TTO SJ Equity 0.116
## 6 JALSHAll Health Care ACT SJ Equity 0.0254
## 7 JALSHAll Industrials GRF SJ Equity -0.0566
## 8 JALSHAll Materials KIO SJ Equity 0.221
## 9 NKY Communications 9433 JP Equity -0.0152
## 10 NKY Consumer Discretionary 7269 JP Equity 0.0940
## # ℹ 17 more rows
Now, let’s have some more fun while we exercise our wrangling skills.
Here is the entire transcript of the series Friends:
With this, we now have access to the lines in the series:
## # A tibble: 67,373 × 6
## text speaker season episode scene utterance
## <chr> <chr> <int> <int> <int> <int>
## 1 There's nothing to tell! He's just so… Monica… 1 1 1 1
## 2 C'mon, you're going out with the guy!… Joey T… 1 1 1 2
## 3 All right Joey, be nice. So does he h… Chandl… 1 1 1 3
## 4 Wait, does he eat chalk? Phoebe… 1 1 1 4
## 5 (They all stare, bemused.) Scene … 1 1 1 5
## 6 Just, 'cause, I don't want her to go … Phoebe… 1 1 1 6
## 7 Okay, everybody relax. This is not ev… Monica… 1 1 1 7
## 8 Sounds like a date to me. Chandl… 1 1 1 8
## 9 [Time Lapse] Scene … 1 1 1 9
## 10 Alright, so I'm back in high school, … Chandl… 1 1 1 10
## # ℹ 67,363 more rows
Some more author details - e.g. scene directions (or some more non-spoken descriptions):
## # A tibble: 12,606 × 5
## season episode scene utterance emotion
## <int> <int> <int> <int> <chr>
## 1 1 1 4 1 Mad
## 2 1 1 4 3 Neutral
## 3 1 1 4 4 Joyful
## 4 1 1 4 5 Neutral
## 5 1 1 4 6 Neutral
## 6 1 1 4 7 Neutral
## 7 1 1 4 8 Scared
## 8 1 1 4 10 Joyful
## 9 1 1 4 11 Joyful
## 10 1 1 4 12 Sad
## # ℹ 12,596 more rows
As well as ratings information for the series:
## # A tibble: 236 × 8
## season episode title directed_by written_by air_date us_views_millions
## <int> <int> <chr> <chr> <chr> <date> <dbl>
## 1 1 1 The Pilot James Burr… David Cra… 1994-09-22 21.5
## 2 1 2 The One w… James Burr… David Cra… 1994-09-29 20.2
## 3 1 3 The One w… James Burr… Jeffrey A… 1994-10-06 19.5
## 4 1 4 The One w… James Burr… Alexa Jun… 1994-10-13 19.7
## 5 1 5 The One w… Pamela Fry… Jeff Gree… 1994-10-20 18.6
## 6 1 6 The One w… Arlene San… Adam Chas… 1994-10-27 18.2
## 7 1 7 The One w… James Burr… Jeffrey A… 1994-11-03 23.5
## 8 1 8 The One W… James Burr… Marta Kau… 1994-11-10 21.1
## 9 1 9 The One W… James Burr… Jeff Gree… 1994-11-17 23.1
## 10 1 10 The One w… Peter Bone… Adam Chas… 1994-12-15 19.9
## # ℹ 226 more rows
## # ℹ 1 more variable: imdb_rating <dbl>
Play around with this, and learn more of the series friends while practicing your new found skill set..
In the next session, we will expand the tidyverse into tidy plotting using ggplot.
To give you a taste for this, see the plot for the viewership and ratings over time:
friends_info %>%
group_by(season) %>% summarise(mean_rat = mean(imdb_rating)) %>%
ggplot() + geom_line(aes(season, mean_rat), color = "steelblue", size = 1.2, alpha = 0.8) +
scale_x_continuous(breaks = c(1:length(unique(friends_info$season)))) +
theme_bw() +
labs(x = "Season", y = "Avg Rating", caption = "Calculated using the Friends package in R",
title = "Friends Avg. Ratings Per Season")
# US Viewership (millions):
df_plot <-
friends_info %>%
mutate(Text = ifelse(imdb_rating == min(imdb_rating, na.rm=T), glue::glue("Lowest: S{season}, ep{episode} ({imdb_rating})\n{title}"),
ifelse(imdb_rating == max(imdb_rating, na.rm=T), glue::glue("Highest: S{season}, ep{episode} ({imdb_rating})\n{title}"),
NA_character_))) %>% mutate(season = as.character(season))
# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))
df_plot %>%
ggplot() +
geom_boxplot(aes(x = season, y = imdb_rating, fill = season), alpha = 0.4) +
geom_jitter(aes(season, imdb_rating, color = season), size = 3, alpha = 0.8) +
scale_y_continuous(limits = c(7, 10), breaks = scales::breaks_pretty(n = 6)) +
ggrepel::geom_text_repel(aes(x = season, y = imdb_rating, label = Text),
force = T ) +
theme_bw() +
labs(x = "Season", y = "IMDP Ratings", caption = "Calculated using the Friends package in R",
title = "Friends Ratings Per Episode") +
scale_fill_hue(l=40, c=35) +
scale_color_hue(l=40, c=35) +
guides(fill = F, color = F)
# Viewership:
df_plot <-
friends_info %>%
mutate(Text = ifelse(us_views_millions == min(us_views_millions, na.rm=T), glue::glue("Worst: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"),
ifelse(us_views_millions == max(us_views_millions, na.rm=T), glue::glue("Best: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"),
NA_character_))) %>% mutate(season = as.character(season))
# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))
df_plot %>%
ggplot() +
geom_boxplot(aes(x = season, y = us_views_millions, fill = season), alpha = 0.4) +
geom_jitter(aes(season, us_views_millions, color = season), size = 3, alpha = 0.8) +
scale_y_continuous(limits = c(15, 60), breaks = scales::breaks_pretty(n = 10)) +
ggrepel::geom_text_repel(aes(x = season, y = us_views_millions, label = Text),
force = T ) +
theme_bw() +
labs(x = "Season", y = "Views (mn)", caption = "Calculated using the Friends package in R",
title = "Friends US Views Per Episode") +
scale_fill_hue(l=40, c=35) +
scale_color_hue(l=40, c=35) +
guides(fill = F, color = F)
Typical American viewership - the one with the superbowl…
Hope you enjoyed this session.