Introduction

The aim of this tutorial is to introduce you to the tidyverse.

While there are quite a few references offering some more background, I recommend consulting these for further reading:

Statistical Inference via Data Science: A ModernDive into R and the tidyverse.
Data manipulation in R
And of course, R4DS.

What is the tidyverse?

It is effectively a reimagining of data analytics by requiring your data to be in a tidy format.

Once your data is tidy, analyzing and working with it is simpler, more stable and more efficient - and you will never need to type out clumsy for-loops again (at least in the traditional sense of the word).

Tidy data means each variable is in its own column, with each observation being in its own row.

We’ll get to what this means in examples later on.

(You could read more on the motivation behind tidy data here…)

Let’s first discuss the tidyverse:

tidyverse

The tidyverse package is a collection of various other packages, including:

dplyr : awesome for data wrangling
readr : awesome for importing data
ggplot2 : the standard for plotting in a tidy framework
tidyr : essential for manipulating dataframes
purrr : makes your code (and you) purrr with delight… you’ll see. Absolute synctatic sugar. Pure elation.

Piping

A key command in using the tidyverse is the piping call ‘%>%’.

x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x,y)
We can use a dot to refer to the data input: x %>% fun(.) is equivalent to f(x)

Type CTRL SHIFT M in Rstudio, and you will see the pipe appear. This will be your friend from now on.

‘%>%’ should be thought of and ‘read’ as ‘then’.

Thus, X %>% f(…) %>% g(…) should be read as X then apply f(…) then apply g(…)

This makes your code much more readable (going from left to right) and easier to manage.

Let’s see it in action quick using the ‘select’ command:

pacman::p_load(tidyverse)
data_use <- fmxdat::Indexes
data_use %>% select(Date, brz, chl)

## # A tibble: 790 × 3
##    Date         brz   chl
##    <date>     <dbl> <dbl>
##  1 2000-01-14 1420.   NA 
##  2 2000-01-21 1382.   NA 
##  3 2000-01-28 1344.   NA 
##  4 2000-02-04 1426. 1522.
##  5 2000-02-11 1416. 1544.
##  6 2000-02-18 1372. 1443.
##  7 2000-02-25 1385. 1380.
##  8 2000-03-03 1439. 1456.
##  9 2000-03-10 1424. 1421.
## 10 2000-03-17 1383. 1419.
## # … with 780 more rows

dplyr

The cornerstone of the tidyverse is to make your data, well, tidy.

This means:

It is time to reboot how you think about dataframes.
dplyr requires you to think in a tidy way about your data - in particular, that your data should be in a long format, as opposed to wide.
The thinking behind dplyr is to oranize your data in such a way that it makes analysis simpler (much simpler).
The idea behind tidy data can be described as follows:

What this means practically is that your dataframes become long as opposed to being wide.

If we think about our earlier dataframe - making it tidy implies:

Look at the figure above - what this now does is allow us to easily apply the same functions on each Country sequentially.

Examples later will illustrate the ease with which this can be done.

How do I make my data tidy?

The dplyr verb ‘gather’ is used to make your data tidy - it effectively ‘gathers’ what belongs together.

Naturally this requires us to decide what to call the gather group, what to call the values so gathered, and which columns to gather together.

pacman::p_load(tidyverse)
data_use <- fmxdat::Indexes
Tidydat <- data_use %>% tidyr::gather(Country, val, -Date) %>% arrange(Date)

print(Tidydat)

## # A tibble: 16,590 × 3
##    Date       Country    val
##    <date>     <chr>    <dbl>
##  1 2000-01-14 brz     1420. 
##  2 2000-01-14 chl       NA  
##  3 2000-01-14 chn       36.7
##  4 2000-01-14 col      105. 
##  5 2000-01-14 cze       91.7
##  6 2000-01-14 egt      352. 
##  7 2000-01-14 grc     1215. 
##  8 2000-01-14 hun      369. 
##  9 2000-01-14 ind      175. 
## 10 2000-01-14 ino      280. 
## # … with 16,580 more rows

Now why would this be a preferable format?

Consider now the ease with which we could, e.g., calculate the:

Returns for each Country

Tidydat %>% 
  # When calculating returns - ensure your data is ordered by date
  arrange(Date) %>% 
  group_by(Country) %>% mutate(Returns = val / lag(val) - 1) %>% 
  ungroup()

## # A tibble: 16,590 × 4
##    Date       Country    val Returns
##    <date>     <chr>    <dbl>   <dbl>
##  1 2000-01-14 brz     1420.       NA
##  2 2000-01-14 chl       NA        NA
##  3 2000-01-14 chn       36.7      NA
##  4 2000-01-14 col      105.       NA
##  5 2000-01-14 cze       91.7      NA
##  6 2000-01-14 egt      352.       NA
##  7 2000-01-14 grc     1215.       NA
##  8 2000-01-14 hun      369.       NA
##  9 2000-01-14 ind      175.       NA
## 10 2000-01-14 ino      280.       NA
## # … with 16,580 more rows

mutate

Notice my use of the verb ‘mutate’ above.

Mutate adds a column by applying the function given. The idea can be summarised as:

df = mutate( new_column_name = f ( Exiting_column) )

E.g., we could calculate the standardized returns per country as follows:

Tidydat %>% 
  # When calculating returns - ensure your data is ordered by date
  arrange(Date) %>% 
  group_by(Country) %>% 
  mutate(Returns = val / lag(val) - 1) %>% 
  mutate(Std_Returns = ( Returns - mean(Returns, na.rm=T) ) / sd(Returns, na.rm=T) ) %>% 
  ungroup()

## # A tibble: 16,590 × 5
##    Date       Country    val Returns Std_Returns
##    <date>     <chr>    <dbl>   <dbl>       <dbl>
##  1 2000-01-14 brz     1420.       NA          NA
##  2 2000-01-14 chl       NA        NA          NA
##  3 2000-01-14 chn       36.7      NA          NA
##  4 2000-01-14 col      105.       NA          NA
##  5 2000-01-14 cze       91.7      NA          NA
##  6 2000-01-14 egt      352.       NA          NA
##  7 2000-01-14 grc     1215.       NA          NA
##  8 2000-01-14 hun      369.       NA          NA
##  9 2000-01-14 ind      175.       NA          NA
## 10 2000-01-14 ino      280.       NA          NA
## # … with 16,580 more rows

Notice that I ungrouped when my grouping exercise was done. This is important to not accidentally preserve groups downstream.

filter

‘filtering’ is another useful verb. It allows easy subsetting of your data. E.g. let’s ‘filter’ out the first date of our dataframe above (which has NA returns):

Tidydat %>% 
  # When calculating returns - ensure your data is ordered by date
  arrange(Date) %>% 
  group_by(Country) %>% 
  mutate(Returns = val / lag(val) - 1) %>% 
  filter(Date > first(Date)) %>% 
  ungroup()

## # A tibble: 16,569 × 4
##    Date       Country    val Returns
##    <date>     <chr>    <dbl>   <dbl>
##  1 2000-01-21 brz     1382.  -0.0267
##  2 2000-01-21 chl       NA   NA     
##  3 2000-01-21 chn       36.1 -0.0172
##  4 2000-01-21 col      102.  -0.0281
##  5 2000-01-21 cze       98.3  0.0722
##  6 2000-01-21 egt      375.   0.0656
##  7 2000-01-21 grc     1177.  -0.0320
##  8 2000-01-21 hun      379.   0.0271
##  9 2000-01-21 ind      179.   0.0232
## 10 2000-01-21 ino      261.  -0.0670
## # … with 16,559 more rows

Filter can also be used nicely with vector aruments…

Select_Countries <- c("brz", "zar")
Tidydat %>% filter(Country %in% Select_Countries)

## # A tibble: 1,580 × 3
##    Date       Country   val
##    <date>     <chr>   <dbl>
##  1 2000-01-14 brz     1420.
##  2 2000-01-14 zar      257.
##  3 2000-01-21 brz     1382.
##  4 2000-01-21 zar      254.
##  5 2000-01-28 brz     1344.
##  6 2000-01-28 zar      229.
##  7 2000-02-04 brz     1426.
##  8 2000-02-04 zar      238.
##  9 2000-02-11 brz     1416.
## 10 2000-02-11 zar      239.
## # … with 1,570 more rows

…grepls (e.g. countries with a z or a b in their names)….

Tidydat %>% filter( grepl("z|b", Country))

## # A tibble: 2,370 × 3
##    Date       Country    val
##    <date>     <chr>    <dbl>
##  1 2000-01-14 brz     1420. 
##  2 2000-01-14 cze       91.7
##  3 2000-01-14 zar      257. 
##  4 2000-01-21 brz     1382. 
##  5 2000-01-21 cze       98.3
##  6 2000-01-21 zar      254. 
##  7 2000-01-28 brz     1344. 
##  8 2000-01-28 cze       97.1
##  9 2000-01-28 zar      229. 
## 10 2000-02-04 brz     1426. 
## # … with 2,360 more rows

… dates (notice lubridate package)

pacman::p_load(lubridate)
# ....all dates after 2001 Jan 5:
Tidydat %>% filter( Date >= lubridate::ymd(20010105))

## # A tibble: 15,519 × 3
##    Date       Country    val
##    <date>     <chr>    <dbl>
##  1 2001-01-05 brz     1286. 
##  2 2001-01-05 chl     1202. 
##  3 2001-01-05 chn       26.8
##  4 2001-01-05 col       55.8
##  5 2001-01-05 cze       91.0
##  6 2001-01-05 egt      187. 
##  7 2001-01-05 grc      700. 
##  8 2001-01-05 hun      257. 
##  9 2001-01-05 ind      136. 
## 10 2001-01-05 ino      111. 
## # … with 15,509 more rows

# ....filter last two months only:
Tidydat %>% filter( Date >= last(Date) %m-% months(2))

## # A tibble: 189 × 3
##    Date       Country   val
##    <date>     <chr>   <dbl>
##  1 2015-01-02 brz     4980.
##  2 2015-01-02 chl     4381.
##  3 2015-01-02 chn      115.
##  4 2015-01-02 col     1745.
##  5 2015-01-02 cze      749.
##  6 2015-01-02 egt     1921.
##  7 2015-01-02 grc      171.
##  8 2015-01-02 hun      478.
##  9 2015-01-02 ind      707.
## 10 2015-01-02 ino     1727.
## # … with 179 more rows

… values

Tidydat %>% filter( val > 1000)

## # A tibble: 5,512 × 3
##    Date       Country   val
##    <date>     <chr>   <dbl>
##  1 2000-01-14 brz     1420.
##  2 2000-01-14 grc     1215.
##  3 2000-01-14 mex     2377.
##  4 2000-01-14 tur     1048.
##  5 2000-01-21 brz     1382.
##  6 2000-01-21 grc     1177.
##  7 2000-01-21 mex     2216.
##  8 2000-01-28 brz     1344.
##  9 2000-01-28 grc     1125.
## 10 2000-01-28 mex     2106.
## # … with 5,502 more rows

How. Easy. Is. That.

summarise

Almost similar to mutate - but with the added feature of collapsing your data according to the function provided.

Best illustrated with example:

Tidydat %>% group_by(Country) %>% mutate(Max = max(val, na.rm=T))

## # A tibble: 16,590 × 4
## # Groups:   Country [21]
##    Date       Country    val    Max
##    <date>     <chr>    <dbl>  <dbl>
##  1 2000-01-14 brz     1420.  10340.
##  2 2000-01-14 chl       NA    7483.
##  3 2000-01-14 chn       36.7   140.
##  4 2000-01-14 col      105.   2881.
##  5 2000-01-14 cze       91.7  1346.
##  6 2000-01-14 egt      352.   2675.
##  7 2000-01-14 grc     1215.   1975.
##  8 2000-01-14 hun      369.   1595.
##  9 2000-01-14 ind      175.    885.
## 10 2000-01-14 ino      280.   2072.
## # … with 16,580 more rows

Tidydat %>% group_by(Country) %>% summarise(Max = max(val, na.rm=T))

## # A tibble: 21 × 2
##    Country    Max
##    <chr>    <dbl>
##  1 brz     10340.
##  2 chl      7483.
##  3 chn       140.
##  4 col      2881.
##  5 cze      1346.
##  6 egt      2675.
##  7 grc      1975.
##  8 hun      1595.
##  9 ind       885.
## 10 ino      2072.
## # … with 11 more rows

Notice in the second chunk, summarise only returns the max values by the groups asked. Mutate adds a column, repeating the max values at each date.

join

Joining dataframes is key to good analytics.

Tidy joining is typically done using left_join or full_join by some common column.

Let’s work out an example to show this.

Suppose your tidydat was stored as two excel files with the following formats:

print( fmxdat::Indexes ) ; print(fmxdat::Indexes_Labs )

## # A tibble: 790 × 22
##    Date         brz   chl   chn   col   cze   egt   grc   hun   ind   ino   kor
##    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2000-01-14 1420.   NA   36.7 105.   91.7  352. 1215.  369.  175.  280.  192.
##  2 2000-01-21 1382.   NA   36.1 102.   98.3  375. 1177.  379.  179.  261.  186.
##  3 2000-01-28 1344.   NA   36.0 104.   97.1  354. 1125.  379.  187.  246.  191.
##  4 2000-02-04 1426. 1522.  35.5 105.  104.   349. 1202.  370.  198.  239.  189.
##  5 2000-02-11 1416. 1544.  31.8 104.  114.   359. 1257.  396.  220.  249.  188.
##  6 2000-02-18 1372. 1443.  33.2  97.7 114.   350. 1209.  402.  243.  231.  172.
##  7 2000-02-25 1385. 1380.  29.7  90.3 112.   322. 1195.  405.  222.  219.  168.
##  8 2000-03-03 1439. 1456.  28.5  88.2 117.   327. 1164.  401.  217.  208.  177.
##  9 2000-03-10 1424. 1421.  29.8  88.4 117.   327. 1111.  425.  211.  234.  174.
## 10 2000-03-17 1383. 1419.  29.9  87.9 118.   328. 1099.  399.  189.  230.  169.
## # … with 780 more rows, and 10 more variables: mal <dbl>, mex <dbl>, per <dbl>,
## #   phi <dbl>, pol <dbl>, rus <dbl>, zar <dbl>, twn <dbl>, tha <dbl>, tur <dbl>

## # A tibble: 9 × 2
##   Country Label
##   <chr>   <chr>
## 1 brz     Brics
## 2 chn     Brics
## 3 zar     Brics
## 4 ind     Brics
## 5 rus     Brics
## 6 mex     Mint 
## 7 tur     Mint 
## 8 hun     HEGP 
## 9 egt     HEGP

Notice now that combining this information to use it to do e.g. calcs by BRICS group woudl’ve otherwise been onerous.

Now it is simple using ‘left join’ - to join the Label from the second dataframe to the first:

Joined_Set <- 
left_join(
  fmxdat::Indexes %>% gather(Country, Return, -Date),
  fmxdat::Indexes_Labs,
  by = "Country"
) %>% arrange(Date)

Notice that ‘joining’ allocates Label by Country, replacing non-entry Labels with NA (e.g. chl, col and cze had no label, thus get NA labels)

In the above example full_join and left_join would produce equivalent results, but of course if the ‘right’ dataframe (Indexes_Labs) had other information (like dates) too - it would’ve ensured all data is preserved.

left_join, however, will join from left to right (and discard e.g. dates in the right df from joining to the left. Full join will join all dates)

Creating a sentence with verbs

Let’s now use these verbs on a wide dataframe to do some nice things.

First, our dataset is wide (which should start bothering you):

fmxdat::Wide_Example

## # A tibble: 12 × 15
##    date       SubFactors AAA S…¹ BBB S…² CCC S…³ DDD S…⁴ EEE S…⁵ FFF S…⁶ TTT S…⁷
##    <date>     <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 2002-01-31 EBIT.EV      15.8     7.78   14.8    13.6    12.5    10.4   -1.23 
##  2 2002-01-31 FCF.EV       10.5     6.78   12.3    16.0    13.5    12.0   -1.47 
##  3 2002-01-31 ROA          13.4    15.7    12.6     9.71   10.5     8.55  -1.42 
##  4 2002-01-31 ROE           3.22   15.9    11.9    10.9    11.9    14.3    0.499
##  5 2002-01-31 TRR6M1M      13.4    14.4     9.76   24.6    17.4    18.4    0.482
##  6 2002-01-31 Volat.D.60   11.4     9.82   12.5    13.0    15.6    21.3   -0.138
##  7 2002-02-28 EBIT.EV       5.83   10.7    10.3    17.2     5.04    7.92   1.07 
##  8 2002-02-28 FCF.EV       11.6     9.17   10.5     5.60   13.9     5.52   0.185
##  9 2002-02-28 ROA           4.32   11.5    10.0    13.0    10.2     9.52  -0.930
## 10 2002-02-28 ROE          15.0    13.6    13.0    11.8    20.0     1.98   0.633
## 11 2002-02-28 TRR6M1M      13.7     9.32   10.6    11.3    14.5    15.1    0.517
## 12 2002-02-28 Volat.D.60    6.89    7.87   13.1    10.6    12.2     8.28  -1.21 
## # … with 6 more variables: `UUU SPGLOB Equity` <dbl>,
## #   `VVV SPGLOB Equity` <dbl>, `XXX SPGLOB Equity` <dbl>,
## #   `YYY SPGLOB Equity` <dbl>, `ZZZ SPGLOB Equity` <dbl>, AnotherColumn <chr>,
## #   and abbreviated variable names ¹`AAA SJ Equity`, ²`BBB SJ Equity`,
## #   ³`CCC SJ Equity`, ⁴`DDD SJ Equity`, ⁵`EEE SJ Equity`, ⁶`FFF SJ Equity`,
## #   ⁷`TTT SPGLOB Equity`

Before we gather - let’s plan.

First, our columns suggest two types of stocks - some are from SA (ending with ’ SJ Equity’), while others are US (ending with ’ SPGLOB Equity’).

Second, the values correspond to the SubFactors columm.

Let’s gather these by Stocks generally first, after which we create a column where the countries are explicitly labelled.

To do so, we need to exploit the pattern in the column names. Here we can use the adverbs contains, starts_with, ends_with, one_of, etc.

Let’s use ‘contains’ and ‘Equity’ - which will suffice as it gathers all the stocks (and ignores ‘Date’, ‘AnotherColumn’, and ‘SubFactors’:

df_tidy <- fmxdat::Wide_Example %>% gather(Stocks, Value, contains(" Equity"))
print(df_tidy)

## # A tibble: 144 × 5
##    date       SubFactors AnotherColumn Stocks        Value
##    <date>     <chr>      <chr>         <chr>         <dbl>
##  1 2002-01-31 EBIT.EV    Random        AAA SJ Equity 15.8 
##  2 2002-01-31 FCF.EV     Random        AAA SJ Equity 10.5 
##  3 2002-01-31 ROA        Random        AAA SJ Equity 13.4 
##  4 2002-01-31 ROE        Random        AAA SJ Equity  3.22
##  5 2002-01-31 TRR6M1M    Random        AAA SJ Equity 13.4 
##  6 2002-01-31 Volat.D.60 Random        AAA SJ Equity 11.4 
##  7 2002-02-28 EBIT.EV    Random        AAA SJ Equity  5.83
##  8 2002-02-28 FCF.EV     Random        AAA SJ Equity 11.6 
##  9 2002-02-28 ROA        Random        AAA SJ Equity  4.32
## 10 2002-02-28 ROE        Random        AAA SJ Equity 15.0 
## # … with 134 more rows

Awesome. Let’s now create a column labelling the country based on SJ and SPGLOB using grepl, after which we gsub out the ’ SJ Equity’ and ’ SPGLOB Equity’ parts (see how nice these now play along with mutate…..):

df_tidy_Cleaned <- 
  df_tidy %>% 
  mutate(Country = ifelse(  grepl(" SJ Equity", Stocks), "SA", 
                      ifelse(  grepl(" SPGLOB Equity", Stocks), "SPGLOB",  NA)) ) %>% 
  
  mutate(Stocks = gsub(" SJ Equity| SPGLOB Equity", "", Stocks))

print(df_tidy_Cleaned)

## # A tibble: 144 × 6
##    date       SubFactors AnotherColumn Stocks Value Country
##    <date>     <chr>      <chr>         <chr>  <dbl> <chr>  
##  1 2002-01-31 EBIT.EV    Random        AAA    15.8  SA     
##  2 2002-01-31 FCF.EV     Random        AAA    10.5  SA     
##  3 2002-01-31 ROA        Random        AAA    13.4  SA     
##  4 2002-01-31 ROE        Random        AAA     3.22 SA     
##  5 2002-01-31 TRR6M1M    Random        AAA    13.4  SA     
##  6 2002-01-31 Volat.D.60 Random        AAA    11.4  SA     
##  7 2002-02-28 EBIT.EV    Random        AAA     5.83 SA     
##  8 2002-02-28 FCF.EV     Random        AAA    11.6  SA     
##  9 2002-02-28 ROA        Random        AAA     4.32 SA     
## 10 2002-02-28 ROE        Random        AAA    15.0  SA     
## # … with 134 more rows

This takes a bit of time getting used to - so practice this a bit.

Spread

If you want to go back to a wide format, spread is the opposite of gather.

E.g. let’s spread all the ROA and ROE values for SA stocks (notice my use of %in% below….):

df_tidy_Cleaned %>% 
  filter(SubFactors %in% c("ROA", "ROE"), Country == "SA") %>% 
  spread(SubFactors, Value)

## # A tibble: 12 × 6
##    date       AnotherColumn Stocks Country   ROA   ROE
##    <date>     <chr>         <chr>  <chr>   <dbl> <dbl>
##  1 2002-01-31 Random        AAA    SA      13.4   3.22
##  2 2002-01-31 Random        BBB    SA      15.7  15.9 
##  3 2002-01-31 Random        CCC    SA      12.6  11.9 
##  4 2002-01-31 Random        DDD    SA       9.71 10.9 
##  5 2002-01-31 Random        EEE    SA      10.5  11.9 
##  6 2002-01-31 Random        FFF    SA       8.55 14.3 
##  7 2002-02-28 Random        AAA    SA       4.32 15.0 
##  8 2002-02-28 Random        BBB    SA      11.5  13.6 
##  9 2002-02-28 Random        CCC    SA      10.0  13.0 
## 10 2002-02-28 Random        DDD    SA      13.0  11.8 
## 11 2002-02-28 Random        EEE    SA      10.2  20.0 
## 12 2002-02-28 Random        FFF    SA       9.52  1.98

Other helpful verbs:

There are many many many combinations of these core verbs that allow efficient data analytics - sometimes with the help of some auxilliary verbs.

E.g., let’s use this dplyr native dataframe: ‘dplyr::storms’ to do some nice wrangling from which you can learn a few tricks… (I will add the tricks in brackets before using it)

# The df looks like:
print(dplyr::storms)

## # A tibble: 19,066 × 13
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>   <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical dep…      NA    25    1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical dep…      NA    25    1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical dep…      NA    25    1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical dep…      NA    25    1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical dep…      NA    25    1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical dep…      NA    25    1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical dep…      NA    25    1011
##  8 Amy    1975     6    28    18  34   -77   tropical dep…      NA    30    1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical sto…      NA    35    1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical sto…      NA    40    1002
## # … with 19,056 more rows, 2 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>, and
## #   abbreviated variable names ¹category, ²pressure

Let’s see which statusses are available (status, unique):

dplyr::storms %>% pull(status) %>% unique

## [1] tropical depression    tropical storm         extratropical         
## [4] hurricane              subtropical storm      subtropical depression
## [7] disturbance            other low              tropical wave         
## 9 Levels: disturbance extratropical hurricane ... tropical wave

Let’s replace NA values in hurricane_force_diameter with something elaborate: the ‘lat’ column’s minimum values times the ‘long’ column’s max values (coalesce):

dplyr::storms %>% mutate(hurricane_force_diameter = coalesce(hurricane_force_diameter, min(lat, na.rm=T) * max(long, na.rm=T)))

## # A tibble: 19,066 × 13
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>   <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical dep…      NA    25    1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical dep…      NA    25    1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical dep…      NA    25    1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical dep…      NA    25    1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical dep…      NA    25    1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical dep…      NA    25    1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical dep…      NA    25    1011
##  8 Amy    1975     6    28    18  34   -77   tropical dep…      NA    30    1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical sto…      NA    35    1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical sto…      NA    40    1002
## # … with 19,056 more rows, 2 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <dbl>, and
## #   abbreviated variable names ¹category, ²pressure

Let’s apply the same function on several columns using across:

dplyr::storms %>% mutate( across( .cols = c(lat, long, pressure, tropicalstorm_force_diameter), 
                                  .fns = ~max(., na.rm=T)) )

## # A tibble: 19,066 × 13
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>   <int>
##  1 Amy    1975     6    27     0  70.7  13.5 tropical dep…      NA    25    1024
##  2 Amy    1975     6    27     6  70.7  13.5 tropical dep…      NA    25    1024
##  3 Amy    1975     6    27    12  70.7  13.5 tropical dep…      NA    25    1024
##  4 Amy    1975     6    27    18  70.7  13.5 tropical dep…      NA    25    1024
##  5 Amy    1975     6    28     0  70.7  13.5 tropical dep…      NA    25    1024
##  6 Amy    1975     6    28     6  70.7  13.5 tropical dep…      NA    25    1024
##  7 Amy    1975     6    28    12  70.7  13.5 tropical dep…      NA    25    1024
##  8 Amy    1975     6    28    18  70.7  13.5 tropical dep…      NA    30    1024
##  9 Amy    1975     6    29     0  70.7  13.5 tropical sto…      NA    35    1024
## 10 Amy    1975     6    29     6  70.7  13.5 tropical sto…      NA    40    1024
## # … with 19,056 more rows, 2 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>, and
## #   abbreviated variable names ¹category, ²pressure

# Or using summarise and a naming pattern:
dplyr::storms %>% summarise( across( .cols = c(contains("diameter"), starts_with("press")), 
                                  .fns = ~max(., na.rm=T)) )

## # A tibble: 1 × 3
##   tropicalstorm_force_diameter hurricane_force_diameter pressure
##                          <int>                    <int>    <int>
## 1                         1440                      300     1024

Let’s apply several functions on several columns - naming each accordingly using a suffix

dplyr::storms %>% mutate( across( .cols = c(lat, long, pressure, tropicalstorm_force_diameter), 
                                  .fns = list( Max = ~max(., na.rm=T), Min = ~min(., na.rm=T)) ))

## # A tibble: 19,066 × 21
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>   <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical dep…      NA    25    1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical dep…      NA    25    1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical dep…      NA    25    1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical dep…      NA    25    1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical dep…      NA    25    1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical dep…      NA    25    1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical dep…      NA    25    1011
##  8 Amy    1975     6    28    18  34   -77   tropical dep…      NA    30    1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical sto…      NA    35    1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical sto…      NA    40    1002
## # … with 19,056 more rows, 10 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>,
## #   lat_Max <dbl>, lat_Min <dbl>, long_Max <dbl>, long_Min <dbl>,
## #   pressure_Max <int>, pressure_Min <int>,
## #   tropicalstorm_force_diameter_Max <int>,
## #   tropicalstorm_force_diameter_Min <int>, and abbreviated variable names
## #   ¹category, ²pressure

Bit more advanced….

Let’s create a column appended to the original dataframe that gives the maximum windspeed, storm name and its status, per year. Then calculate each day’s max speed as a percentage of the max speed reached for that year.

left_join(
  dplyr::storms,
  dplyr::storms %>% group_by(year) %>% filter(wind == max(wind)) %>% 
    select(year, name, wind, status ) %>% ungroup() %>% 
    mutate(MaxSpeed = wind, Max_Wind_Speed_Msg = paste0("max status reached: ", status)) %>% 
    select(-status, -wind, -name) %>% unique,
  by = c("year")
) %>% 
  mutate(Pct_of_Max = paste0( round( wind / MaxSpeed, 2) * 100, "%" ))

## # A tibble: 19,066 × 16
##    name   year month   day  hour   lat  long status        categ…¹  wind press…²
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>   <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical dep…      NA    25    1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical dep…      NA    25    1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical dep…      NA    25    1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical dep…      NA    25    1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical dep…      NA    25    1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical dep…      NA    25    1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical dep…      NA    25    1011
##  8 Amy    1975     6    28    18  34   -77   tropical dep…      NA    30    1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical sto…      NA    35    1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical sto…      NA    40    1002
## # … with 19,056 more rows, 5 more variables:
## #   tropicalstorm_force_diameter <int>, hurricane_force_diameter <int>,
## #   MaxSpeed <int>, Max_Wind_Speed_Msg <chr>, Pct_of_Max <chr>, and abbreviated
## #   variable names ¹category, ²pressure

How super powerful are these tools right? We will see more example of things we can do later.

stringr

Nice for adjusting strings:

str_sub(): Extract substrings from a character vector
str_trim():Trim white spaces
str_length(): Checks the length of the string
str_to_lower/str_to_upper: Converts the string into upper case or lower case

Examples

Let’s try out some more fun data wrangling examples…. hope you are getting excited!

Cryptos!

Let’s calculate a few things, per month, for several crypto currencies.

Tip: using format for dates here… format(date, “%b”) gives e.g the month as Feb. %B gives February, while %m gives 02.
- Play around, it is quite flexible: format(today(), “%d %B, i.e. %dst day of %B in the year %Y”)

Now let’s crypto…

First - calculate the monthly: returns for each crypto, the max spread between high and low (notice: n() for amount of observations, or rows, by group) :

Cryptos <- fmxdat::cryptos

Crypto_Adj <- 
Cryptos %>% 
  mutate(YearMonth = format(date, "%b%y")) %>% 
  group_by(YearMonth) %>% 
  mutate(LastDate = last(date)) %>% 
  group_by(LastDate, name, YearMonth) %>% 
  summarise(
    Monthly_Return = last(close) / first(close) - 1,
    hilo = max(high) - min(low), 
    N_Cryptos = n())  

print(Crypto_Adj)

## # A tibble: 1,631 × 6
## # Groups:   LastDate, name [1,631]
##    LastDate   name     YearMonth Monthly_Return   hilo N_Cryptos
##    <date>     <chr>    <chr>              <dbl>  <dbl>     <int>
##  1 2013-04-30 Bitcoin  Apr13             0.0357 15.4           3
##  2 2013-04-30 Litecoin Apr13            -0.0115  0.400         3
##  3 2013-05-31 Bitcoin  May13             0.103  60.8          31
##  4 2013-05-31 Litecoin May13            -0.221   1.96         31
##  5 2013-06-30 Bitcoin  Jun13            -0.253  41.3          30
##  6 2013-06-30 Litecoin Jun13             0.0530  1.56         30
##  7 2013-07-31 Bitcoin  Jul13             0.205  45.8          31
##  8 2013-07-31 Litecoin Jul13            -0.0271  0.97         31
##  9 2013-08-31 Bitcoin  Aug13             0.301  39.7          31
## 10 2013-08-31 Litecoin Aug13            -0.113   0.63         31
## # … with 1,621 more rows

Let’s now go further and calculate the amount of times each coin was in the top three performers after 2014, and rank the stocks (Notice: top_n ):

Crypto_Adj %>% 
  group_by(YearMonth) %>% 
  mutate(Rank = rank(Monthly_Return) ) %>% 
  group_by(name) %>% 
  summarise( Pct_Top_Three = sum( ifelse( LastDate >= lubridate::ymd(20140101) & 
                                            Rank %in% c(1, 2, 3), 1, 0 ) ) / n()) %>% 
  top_n(5, Pct_Top_Three) %>% 
  arrange(desc(Pct_Top_Three))

## # A tibble: 5 × 2
##   name         Pct_Top_Three
##   <chr>                <dbl>
## 1 Algorand             0.375
## 2 MINDOL               0.364
## 3 ThoreNext            0.333
## 4 Ravencoin            0.261
## 5 Bitcoin Gold         0.25

Look at that last line - note how much is being done in a single line of code.

Now, see if you can calculate the coin with the highest return since 2018 for coins that listed before March 2016. Go!

Finance

Provided with the following dataframe: fmxdat::Fin_Data_SA_US_NKY, let’s:

Calculate the market cap weighted and Sector index weights of each sector and universe (Note this dataset contains only a few stocks and is NOT representative of the universes they trade in… these are for illustrative purposes only):

fmxdat::Fin_Data_SA_US_NKY %>% 
  group_by(date, Universe) %>% 
  mutate(Market_Weight = Market.Cap / sum(Market.Cap)) %>%  
  group_by(date, Universe, BICS_LEVEL_1_SECTOR_NAME) %>% 
  mutate(Index_Weight = Market.Cap / sum(Market.Cap)) %>% arrange(date)

## # A tibble: 35,704 × 10
## # Groups:   date, Universe, BICS_LEVEL_1_SECTOR_NAME [7,641]
##    date       Tickers        TRI Short…¹ Marke…² BICS_…³ BICS_…⁴ Unive…⁵ Marke…⁶
##    <date>     <chr>        <dbl> <chr>     <dbl> <chr>   <chr>   <chr>     <dbl>
##  1 2017-01-02 ACT SJ Equ…  1.17  AFROCE… 2.83e 8 Health… Health… JALSHA… 0.00132
##  2 2017-01-02 ANG SJ Equ… 16.4   ANGLOG… 5.15e 9 Materi… Metals… JALSHA… 0.0241 
##  3 2017-01-02 AWA SJ Equ…  0.967 ARROWH… 6.85e 8 Financ… Real E… JALSHA… 0.00321
##  4 2017-01-02 BID SJ Equ… 18.0   BID CO… 5.77e 9 Consum… Commer… JALSHA… 0.0270 
##  5 2017-01-02 BIL SJ Equ… 28.8   BHP BI… 1.03e11 Materi… Iron &… JALSHA… 0.481  
##  6 2017-01-02 CLH SJ Equ… 28.9   CITY L… 4.76e 8 Consum… Gaming… JALSHA… 0.00223
##  7 2017-01-02 CLS SJ Equ… 14.9   CLICKS… 2.22e 9 Consum… Retail… JALSHA… 0.0104 
##  8 2017-01-02 COH SJ Equ…  3.20  CURRO … 1.44e 9 Consum… Consum… JALSHA… 0.00674
##  9 2017-01-02 CRP SJ Equ…  0.690 CAP & … 5.06e 8 Financ… Real E… JALSHA… 0.00237
## 10 2017-01-02 EPP SJ Equ…  1.42  ECHO P… 8.48e 8 Financ… Real E… JALSHA… 0.00397
## # … with 35,694 more rows, 1 more variable: Index_Weight <dbl>, and abbreviated
## #   variable names ¹Short.Name, ²Market.Cap, ³BICS_LEVEL_1_SECTOR_NAME,
## #   ⁴BICS_LEVEL_2_INDUSTRY_GROUP_NAME, ⁵Universe, ⁶Market_Weight

Calculate the best performing stock, per universe and sector, since 2017 July - in terms of Sharpe Ratios (Notice my use of sequential group_bys to achieve my intended outcome. Planning your group_by operations are essential to achieve what you set out to achieve):

fmxdat::Fin_Data_SA_US_NKY %>% 
  arrange(date) %>% 
  group_by(Tickers) %>% 
  mutate(Return = TRI / lag(TRI) - 1) %>% filter(date > first(date)) %>% 
  filter(date >= lubridate::ymd(20170701)) %>%
  group_by(Universe, BICS_LEVEL_1_SECTOR_NAME, Tickers) %>% 
  summarise(Sharpe = mean(Return, na.rm = T) / sd(Return, na.rm = T)) %>% 
  group_by(Universe, BICS_LEVEL_1_SECTOR_NAME) %>% 
  filter(Sharpe == max(Sharpe, na.rm=T)) %>% ungroup()

## # A tibble: 27 × 4
##    Universe BICS_LEVEL_1_SECTOR_NAME Tickers         Sharpe
##    <chr>    <chr>                    <chr>            <dbl>
##  1 JALSHAll Communications           VOD SJ Equity   0.0483
##  2 JALSHAll Consumer Discretionary   CLH SJ Equity   0.0590
##  3 JALSHAll Consumer Staples         CLS SJ Equity   0.110 
##  4 JALSHAll Energy                   EXX SJ Equity   0.162 
##  5 JALSHAll Financials               TTO SJ Equity   0.116 
##  6 JALSHAll Health Care              ACT SJ Equity   0.0254
##  7 JALSHAll Industrials              GRF SJ Equity  -0.0566
##  8 JALSHAll Materials                KIO SJ Equity   0.221 
##  9 NKY      Communications           9433 JP Equity -0.0152
## 10 NKY      Consumer Discretionary   7269 JP Equity  0.0940
## # … with 17 more rows

Friends

Now, let’s have some more fun while we exercise our wrangling skills.

Here is the entire transcript of the series Friends:

library(tidyverse)
pacman::p_load(friends, scales)

With this, we now have access to the lines in the series:

print(friends)

## # A tibble: 67,373 × 6
##    text                                     speaker season episode scene utter…¹
##    <chr>                                    <chr>    <int>   <int> <int>   <int>
##  1 There's nothing to tell! He's just some… Monica…      1       1     1       1
##  2 C'mon, you're going out with the guy! T… Joey T…      1       1     1       2
##  3 All right Joey, be nice. So does he hav… Chandl…      1       1     1       3
##  4 Wait, does he eat chalk?                 Phoebe…      1       1     1       4
##  5 (They all stare, bemused.)               Scene …      1       1     1       5
##  6 Just, 'cause, I don't want her to go th… Phoebe…      1       1     1       6
##  7 Okay, everybody relax. This is not even… Monica…      1       1     1       7
##  8 Sounds like a date to me.                Chandl…      1       1     1       8
##  9 [Time Lapse]                             Scene …      1       1     1       9
## 10 Alright, so I'm back in high school, I'… Chandl…      1       1     1      10
## # … with 67,363 more rows, and abbreviated variable name ¹utterance

Some more author details - e.g. scene directions (or some more non-spoken descriptions):

print(friends_emotions)

## # A tibble: 12,606 × 5
##    season episode scene utterance emotion
##     <int>   <int> <int>     <int> <chr>  
##  1      1       1     4         1 Mad    
##  2      1       1     4         3 Neutral
##  3      1       1     4         4 Joyful 
##  4      1       1     4         5 Neutral
##  5      1       1     4         6 Neutral
##  6      1       1     4         7 Neutral
##  7      1       1     4         8 Scared 
##  8      1       1     4        10 Joyful 
##  9      1       1     4        11 Joyful 
## 10      1       1     4        12 Sad    
## # … with 12,596 more rows

As well as ratings information for the series:

print(friends_info)

## # A tibble: 236 × 8
##    season episode title               direc…¹ writt…² air_date   us_vi…³ imdb_…⁴
##     <int>   <int> <chr>               <chr>   <chr>   <date>       <dbl>   <dbl>
##  1      1       1 The Pilot           James … David … 1994-09-22    21.5     8.3
##  2      1       2 The One with the S… James … David … 1994-09-29    20.2     8.1
##  3      1       3 The One with the T… James … Jeffre… 1994-10-06    19.5     8.2
##  4      1       4 The One with Georg… James … Alexa … 1994-10-13    19.7     8.1
##  5      1       5 The One with the E… Pamela… Jeff G… 1994-10-20    18.6     8.5
##  6      1       6 The One with the B… Arlene… Adam C… 1994-10-27    18.2     8.1
##  7      1       7 The One with the B… James … Jeffre… 1994-11-03    23.5     9  
##  8      1       8 The One Where Nana… James … Marta … 1994-11-10    21.1     8.1
##  9      1       9 The One Where Unde… James … Jeff G… 1994-11-17    23.1     8.2
## 10      1      10 The One with the M… Peter … Adam C… 1994-12-15    19.9     8.1
## # … with 226 more rows, and abbreviated variable names ¹directed_by,
## #   ²written_by, ³us_views_millions, ⁴imdb_rating

Play around with this, and learn more of the series friends while practicing your new found skill set..

In the next session, we will expand the tidyverse into tidy plotting using ggplot.

To give you a taste for this, see the plot for the viewership and ratings over time:

friends_info %>% 
  group_by(season) %>% summarise(mean_rat = mean(imdb_rating)) %>% 
  ggplot() + geom_line(aes(season, mean_rat), color = "steelblue", size = 1.2, alpha = 0.8) + 
  scale_x_continuous(breaks = c(1:length(unique(friends_info$season)))) + 
  theme_bw() + 
  labs(x = "Season", y = "Avg Rating", caption = "Calculated using the Friends package in R",
       title = "Friends Avg. Ratings Per Season")

# US Viewership (millions):

df_plot <- 
friends_info %>% 
  mutate(Text = ifelse(imdb_rating == min(imdb_rating, na.rm=T), glue::glue("Lowest: S{season}, ep{episode} ({imdb_rating})\n{title}"), 
                        ifelse(imdb_rating == max(imdb_rating, na.rm=T), glue::glue("Highest: S{season}, ep{episode} ({imdb_rating})\n{title}"),
                               NA_character_))) %>% mutate(season = as.character(season))

# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))

df_plot %>% 
  ggplot() + 
  geom_boxplot(aes(x = season, y = imdb_rating, fill = season), alpha = 0.4) + 
  geom_jitter(aes(season, imdb_rating, color = season), size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(7, 10), breaks = scales::breaks_pretty(n = 6)) + 
  ggrepel::geom_text_repel(aes(x = season, y = imdb_rating, label = Text), 
                           force = T ) + 
  theme_bw() + 
  labs(x = "Season", y = "IMDP Ratings", caption = "Calculated using the Friends package in R",
       title = "Friends Ratings Per Episode") + 
  scale_fill_hue(l=40, c=35) + 
  scale_color_hue(l=40, c=35) + 
  guides(fill = F, color = F)

# Viewership:

df_plot <- 
friends_info %>% 
  mutate(Text = ifelse(us_views_millions == min(us_views_millions, na.rm=T), glue::glue("Worst: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"), 
                        ifelse(us_views_millions == max(us_views_millions, na.rm=T), glue::glue("Best: S{season}, ep{episode} ({us_views_millions}mn)\n{title}"),
                               NA_character_))) %>% mutate(season = as.character(season))

# For ordering (otherwise it goes 1;10;2;3;...):
df_plot$season <- factor(df_plot$season, levels = as.character(1:10))

df_plot %>% 
  ggplot() + 
  geom_boxplot(aes(x = season, y = us_views_millions, fill = season), alpha = 0.4) + 
  geom_jitter(aes(season, us_views_millions, color = season), size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(15, 60), breaks = scales::breaks_pretty(n = 10)) + 
  ggrepel::geom_text_repel(aes(x = season, y = us_views_millions, label = Text), 
                           force = T ) + 
  theme_bw() + 
    labs(x = "Season", y = "Views (mn)", caption = "Calculated using the Friends package in R",
       title = "Friends US Views Per Episode") + 
  scale_fill_hue(l=40, c=35) + 
  scale_color_hue(l=40, c=35) + 
  guides(fill = F, color = F)

Typical American viewership - the one with the superbowl…

Hope you enjoyed this session.

Applied Data Science

Tidy Coding

NF Katzke