In summary - the tidyverse package is a collection of various other packages, including:
dplyr : awesome for data wrangling
readr : awesome for importing data
ggplot2 : the standard for plotting in a tidy framework
tidyr : essential for manipulating dataframes
purrr : makes your code (and you) purrr with delight… you’ll see. Absolute synctatic sugar. Pure elation.
lubridate : Really the only package worthwhile when working with dates
Remember this is a key command in using the tidyverse is the piping call %>%.
x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x,y)
We can use a dot to refer to the data input: x %>% fun(.) is equivalent to f(x)
Type CTRL SHIFT M in Rstudio, and you will see the pipe appear. This will be your friend from now on.
You have already installed dplyr when you installed tidyverse, but just to be sure execute install.packages(“dplyr”).
E.g., using dplyr on our previous practical’s dataset:
# Do not run:
rmsfuns::load_pkg("tidyverse")
library(rmsfuns)
load_pkg("readr")
data <- fmxdat::BRICSTRI
data %>% select(Date, zar)
Notice that “zar” has now been selected. Also, VERY NB, piping does not change the structure of the data, unless you save it.
This is VERY useful, as it allows you to look at what the data would look like before saving it. To check this, type out View(data) and see that it has not changed the data frame after piping the select command.
To save a change, use the <- as follows:
So, time to reboot how you think about dataframes.
dplyr requires you to think in a tidy way about your data - in particular, that your data should be in a long format.
The thinking behind dplyr is to oranize your data in such a way that it makes analysis simpler.
The idea behind tidy data can be described as follows:
Run the following code to see tidy vs non-tidy broken down in simple terms:
library(rmsfuns)
pkgstoload <- c("tidyverse","lubridate", "ggplot2")
load_pkg(pkgstoload)
datawide <- fmxdat::Wide_Example
head(datawide)
## # A tibble: 6 × 15
## date SubFactors AAA SJ…¹ BBB S…² CCC S…³ DDD S…⁴ EEE S…⁵ FFF S…⁶ TTT S…⁷
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2002-01-31 EBIT.EV 15.8 7.78 14.8 13.6 12.5 10.4 -1.23
## 2 2002-01-31 FCF.EV 10.5 6.78 12.3 16.0 13.5 12.0 -1.47
## 3 2002-01-31 ROA 13.4 15.7 12.6 9.71 10.5 8.55 -1.42
## 4 2002-01-31 ROE 3.22 15.9 11.9 10.9 11.9 14.3 0.499
## 5 2002-01-31 TRR6M1M 13.4 14.4 9.76 24.6 17.4 18.4 0.482
## 6 2002-01-31 Volat.D.60 11.4 9.82 12.5 13.0 15.6 21.3 -0.138
## # … with 6 more variables: `UUU SPGLOB Equity` <dbl>,
## # `VVV SPGLOB Equity` <dbl>, `XXX SPGLOB Equity` <dbl>,
## # `YYY SPGLOB Equity` <dbl>, `ZZZ SPGLOB Equity` <dbl>, AnotherColumn <chr>,
## # and abbreviated variable names ¹`AAA SJ Equity`, ²`BBB SJ Equity`,
## # ³`CCC SJ Equity`, ⁴`DDD SJ Equity`, ⁵`EEE SJ Equity`, ⁶`FFF SJ Equity`,
## # ⁷`TTT SPGLOB Equity`
Notice that in the df above, all the Tickers have their own columns. But as mentioned, by definition of tidyness, all columns must be a variable. As there are similarities between the wide columns (they are all Tickers) - they belong to the same column called Tickers!
So we need to make the wide data tidy by gathering all the ticker columns into a single column. I specifically mutated two random columns to show you how to gather only the Ticker columns…
data_tidy <-
datawide %>% gather(key = Tickers, value = Scores, contains(" Equity") )
head(data_tidy)
## # A tibble: 6 × 5
## date SubFactors AnotherColumn Tickers Scores
## <date> <chr> <chr> <chr> <dbl>
## 1 2002-01-31 EBIT.EV Random AAA SJ Equity 15.8
## 2 2002-01-31 FCF.EV Random AAA SJ Equity 10.5
## 3 2002-01-31 ROA Random AAA SJ Equity 13.4
## 4 2002-01-31 ROE Random AAA SJ Equity 3.22
## 5 2002-01-31 TRR6M1M Random AAA SJ Equity 13.4
## 6 2002-01-31 Volat.D.60 Random AAA SJ Equity 11.4
Note how easy that was if you know the three inputs above:
key is the name column by which to distinguish the observations
Observation values are given by Scores
third, the columns to gather. Note I used contains(” Equity”), as all the Tickers end with Equity. Calling columns this way is useful, with other commands including: ends_with, one_of(…), etc (see cheatsheet!).
You can also pass in vectors to gather or select by - test this yourself.
Let’s first create a universe distinction between SA and US determined by the SJ Equity or SPGlob part under Tickers:
data_tidy <-
data_tidy %>%
mutate(Universe = ifelse(grepl(" SJ Equity", Tickers), "SA",
ifelse(grepl(" SPGLOB Equity", Tickers), "SPGLOB", NA)))
If we now want to focus only on, e.g., the factor ROE in order to calculate the mean ROE, we simply pipe it in dplyr (don’t forget the dplyr cheatsheet):
data_tidy %>%
group_by(date, Universe, SubFactors) %>%
filter(!is.na(Scores)) %>% # Filter only valid Scoress
mutate( ZScores = (Scores - mean(Scores, na.rm = TRUE)) /
sd(Scores, na.rm = TRUE)) %>% ungroup() # ZScores column created
## # A tibble: 144 × 7
## date SubFactors AnotherColumn Tickers Scores Universe ZScores
## <date> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 2002-01-31 EBIT.EV Random AAA SJ Equity 15.8 SA 1.11
## 2 2002-01-31 FCF.EV Random AAA SJ Equity 10.5 SA -0.430
## 3 2002-01-31 ROA Random AAA SJ Equity 13.4 SA 0.627
## 4 2002-01-31 ROE Random AAA SJ Equity 3.22 SA -1.85
## 5 2002-01-31 TRR6M1M Random AAA SJ Equity 13.4 SA -0.574
## 6 2002-01-31 Volat.D.60 Random AAA SJ Equity 11.4 SA -0.616
## 7 2002-02-28 EBIT.EV Random AAA SJ Equity 5.83 SA -0.834
## 8 2002-02-28 FCF.EV Random AAA SJ Equity 11.6 SA 0.655
## 9 2002-02-28 ROA Random AAA SJ Equity 4.32 SA -1.84
## 10 2002-02-28 ROE Random AAA SJ Equity 15.0 SA 0.409
## # … with 134 more rows
Notice that a ZScores column has now been added to our dataframe with minimal effort. What you need to get right though (and focus with) is correctly grouping and accurately mutating. Tip: use ViewXL to check your calculation in excel if you are uncertain.
Let’s look at our data:
## # A tibble: 6 × 6
## date SubFactors AnotherColumn Tickers Scores Universe
## <date> <chr> <chr> <chr> <dbl> <chr>
## 1 2002-01-31 EBIT.EV Random AAA SJ Equity 15.8 SA
## 2 2002-01-31 FCF.EV Random AAA SJ Equity 10.5 SA
## 3 2002-01-31 ROA Random AAA SJ Equity 13.4 SA
## 4 2002-01-31 ROE Random AAA SJ Equity 3.22 SA
## 5 2002-01-31 TRR6M1M Random AAA SJ Equity 13.4 SA
## 6 2002-01-31 Volat.D.60 Random AAA SJ Equity 11.4 SA
Notice that each variable (Ticker) has its own column - by the definition we saw earlier, this is not tidy.
Tidy_dailydata <-
dailydata %>% gather(Ticker, Prices, -Date)
# Translated to English, this means gather all columns except for the Date column - and call the variables "Ticker" and their corresponding values: Prices.
# Check it out:
head(Tidy_dailydata)
## # A tibble: 6 × 3
## Date Ticker Prices
## <date> <chr> <dbl>
## 1 2003-01-02 ABI SJ 4880
## 2 2003-01-03 ABI SJ 4880
## 3 2003-01-06 ABI SJ 4880
## 4 2003-01-07 ABI SJ 4925
## 5 2003-01-08 ABI SJ 4881
## 6 2003-01-09 ABI SJ 4881
So what is so amazing about it now being tidy?
Tidy_dailydata %>%
group_by(Ticker) %>% # Effectively you are looping across Tickers
mutate(Returns = Prices / lag(Prices) - 1) # You are now creating a column called returns...
## # A tibble: 36,573 × 4
## # Groups: Ticker [73]
## Date Ticker Prices Returns
## <date> <chr> <dbl> <dbl>
## 1 2003-01-02 ABI SJ 4880 NA
## 2 2003-01-03 ABI SJ 4880 0
## 3 2003-01-06 ABI SJ 4880 0
## 4 2003-01-07 ABI SJ 4925 0.00922
## 5 2003-01-08 ABI SJ 4881 -0.00893
## 6 2003-01-09 ABI SJ 4881 0
## 7 2003-01-10 ABI SJ 4900 0.00389
## 8 2003-01-13 ABI SJ 4950 0.0102
## 9 2003-01-14 ABI SJ 4980 0.00606
## 10 2003-01-15 ABI SJ 4910 -0.0141
## # … with 36,563 more rows
Notice the use above of the all powerful group_by call… Also, mutate adds a column.
Let’s check our data and remove Tickers with only NA for prices for a given month:
Tidy_dailydata_Adjusted_No_Full_NA_Per_Month <-
Tidy_dailydata %>%
mutate(Year_Month = format(Date, "%Y%B")) %>%
group_by(Year_Month, Ticker) %>%
mutate(N_Obs = n(), N_NA = sum(is.na(Prices)) ) %>% # Count amount of dates in a month (N_Obs) and also amount of NA per month per Ticker N_NA
filter( N_Obs != N_NA) %>% ungroup()
print(Tidy_dailydata)
## # A tibble: 36,573 × 3
## Date Ticker Prices
## <date> <chr> <dbl>
## 1 2003-01-02 ABI SJ 4880
## 2 2003-01-03 ABI SJ 4880
## 3 2003-01-06 ABI SJ 4880
## 4 2003-01-07 ABI SJ 4925
## 5 2003-01-08 ABI SJ 4881
## 6 2003-01-09 ABI SJ 4881
## 7 2003-01-10 ABI SJ 4900
## 8 2003-01-13 ABI SJ 4950
## 9 2003-01-14 ABI SJ 4980
## 10 2003-01-15 ABI SJ 4910
## # … with 36,563 more rows
Notice the use of filter above. We were able to filter out stocks for months where all the values for that particular stock was NA. That’s a lot happening in only a few lines!
Also notice that we ALWAYS ALWAYS ALWAYS should ungroup() at the end of a pipe. This stops us from accidentally using grouping information unintended downstream…
Let’s now e.g. create a few monthly statistics, in particular the following:
* Monthly Returns
* Monthly Minimum and maximum daily returns
# To achieve the above, we need to now do the following:
# Step one is to calculate the daily returns, and then calculate the Monthly minimum and maximum for each stock.
# Only then do we calculate the monthly returns for each stock:
Tidy_dailydata_Adjusted_No_Full_NA_Per_Month %>%
mutate(Year_Month = format(Date, "%Y%B")) %>%
arrange(Date) %>% # Make sure your date column is arranged from earliest to most recent...
group_by(Ticker) %>%
mutate(DailyReturn = Prices / lag(Prices) - 1) %>%
group_by( Year_Month, Ticker) %>%
mutate( across(DailyReturn, .fns = list( Min = ~min(., na.rm = TRUE), Max = ~max(., na.rm = TRUE), Mean = ~mean(., na.rm = TRUE)) ) ) %>%
filter(Date == last(Date)) %>%
group_by(Ticker) %>% # Careful - if you kept the group_by(Year_Month, Ticker) - it would have led to only NAs, as eah Month, Ticker only now has one value - and we need only group_by Ticker..
mutate(Monthly_Returns = Prices / lag(Prices) - 1) %>% arrange(Ticker) %>%
ungroup() %>%
select(Date, Ticker, DailyReturn:Monthly_Returns)
## # A tibble: 830 × 7
## Date Ticker DailyReturn DailyReturn_Min DailyReturn_…¹ DailyR…² Month…³
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2003-01-31 ABI SJ -0.000213 -0.0244 0.0102 -1.76e-3 NA
## 2 2003-02-28 ABI SJ 0.0265 -0.0408 0.0265 1.68e-3 0.0319
## 3 2003-03-31 ABI SJ -0.0215 -0.0396 0.0269 -3.07e-3 -0.0619
## 4 2003-04-30 ABI SJ 0.0314 -0.0213 0.0314 3.99e-3 0.0769
## 5 2003-05-30 ABI SJ 0 -0.00786 0.0398 1.95e-3 0.0408
## 6 2003-06-30 ABI SJ 0.00368 -0.0265 0.0378 3.44e-3 0.0686
## 7 2003-07-31 ABI SJ -0.0128 -0.0191 0.0493 2.61e-3 0.0596
## 8 2003-08-29 ABI SJ 0 -0.0185 0.0188 1.87e-3 0.0390
## 9 2003-09-30 ABI SJ 0 -0.0207 0.0290 4.74e-5 0
## 10 2003-10-31 ABI SJ -0.00952 -0.0455 0.0427 1.86e-3 0.0400
## # … with 820 more rows, and abbreviated variable names ¹DailyReturn_Max,
## # ²DailyReturn_Mean, ³Monthly_Returns
Notice above the use of across. This is powerful, as it allows the application of many function in a group_by call on many columns - let the power of that sink in…
Notice too our use of select. select can also be used in the following form:
Notice in the following function the use of summarise. Summarise collapses the given columns by the grouping specified:
SD_df <-
Tidy_dailydata_Adjusted_No_Full_NA_Per_Month %>%
mutate(Year_Month = format(Date, "%Y%B")) %>%
arrange(Date) %>% # Make sure your date column is arranged from earliest to most recent...
group_by(Ticker) %>%
mutate(DailyReturn = Prices / lag(Prices) - 1) %>%
group_by(Year_Month, Ticker) %>%
# Calculate each stock's Sharpe Ratio:
summarise(Sharpe = mean(DailyReturn, na.rm = TRUE) / sd(DailyReturn, na.rm = TRUE)) %>% ungroup()
# Notice that now we have collapse our dataframe using summarise to do exactly what it says: we summarise for each Ticker and the month's standard deviation. Read this sentence again and make sure you understand the intuition behind summarise.
print(SD_df)
## # A tibble: 830 × 3
## Year_Month Ticker Sharpe
## <chr> <chr> <dbl>
## 1 2003April ABI SJ 0.305
## 2 2003April ACL SJ -0.112
## 3 2003April AGL SJ -0.322
## 4 2003April AMS SJ 0.267
## 5 2003April ANG SJ 0.165
## 6 2003April AOD SJ -0.270
## 7 2003April AVG SJ -0.274
## 8 2003April BAW SJ -0.0345
## 9 2003April BGA SJ 0.272
## 10 2003April BIL SJ -0.231
## # … with 820 more rows
To really appreciate the full power of dplyr, let’s pull in a more advanced dataset.
The following is the daily prices of SA, SP500 and NKY stocks from 2017.
Let’s play:
Let’s look at our dataframe:
## # A tibble: 6 × 8
## date Tickers TRI Short.Name Marke…¹ BICS_…² BICS_…³ Unive…⁴
## <date> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 2017-01-02 ACT SJ Equity 1.17 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## 2 2017-01-03 ACT SJ Equity 1.13 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## 3 2017-01-04 ACT SJ Equity 1.18 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## 4 2017-01-05 ACT SJ Equity 1.18 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## 5 2017-01-06 ACT SJ Equity 1.17 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## 6 2017-01-09 ACT SJ Equity 1.24 AFROCENTRIC IN… 2.83e8 Health… Health… JALSHA…
## # … with abbreviated variable names ¹Market.Cap, ²BICS_LEVEL_1_SECTOR_NAME,
## # ³BICS_LEVEL_2_INDUSTRY_GROUP_NAME, ⁴Universe
So, we have the market cap, sector and subsector for each country (NKY, SP500 and JALSH).
Let’s calculate the value of $1 put into DAIKIN and $1 put into SUZUKI and KAWASAKI shares since 2017:
Step 1: Find Ticker names for each stock:
# Below is a quick way of pulling and looking at the unique Short.Names to find the stocks' Short.Names:
Daily_TRI %>% pull(Short.Name ) %>% unique()
Notice that pull takes a column and transforms it into a vector. This is at times extremely useful, as it can then e.g. be used as an input to another function.
Daily_TRI_Three <-
Daily_TRI %>%
filter(grepl(c( "DAIKIN|KAWASAKI|SUZUKI"), Short.Name)) %>%
group_by(Tickers) %>% mutate(Returns = TRI/lag(TRI) - 1) %>% ungroup()
Note below the use of coalesce: coalesce(Column, 0) - this can be interpreted as saying that NAs in Column should be replaced by 0.
Daily_TRI_Three %>%
mutate(Returns = coalesce(Returns, 0)) %>% # Override Return column NAs...
group_by(Tickers) %>%
mutate(Cum_Return = cumprod( 1 + Returns))
## # A tibble: 849 × 10
## # Groups: Tickers [3]
## date Tickers TRI Short…¹ Marke…² BICS_…³ BICS_…⁴ Unive…⁵ Returns
## <date> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2017-01-02 6367 JP Eq… 108. DAIKIN… 2.91e10 Indust… Electr… NKY 0
## 2 2017-01-03 6367 JP Eq… 108. DAIKIN… 2.91e10 Indust… Electr… NKY 0
## 3 2017-01-04 6367 JP Eq… 111. DAIKIN… 2.91e10 Indust… Electr… NKY 0.0326
## 4 2017-01-05 6367 JP Eq… 112. DAIKIN… 2.91e10 Indust… Electr… NKY 0.00886
## 5 2017-01-06 6367 JP Eq… 111. DAIKIN… 2.91e10 Indust… Electr… NKY -0.00856
## 6 2017-01-09 6367 JP Eq… 111. DAIKIN… 2.91e10 Indust… Electr… NKY 0
## 7 2017-01-10 6367 JP Eq… 113. DAIKIN… 2.91e10 Indust… Electr… NKY 0.0172
## 8 2017-01-11 6367 JP Eq… 113. DAIKIN… 2.91e10 Indust… Electr… NKY -0.00408
## 9 2017-01-12 6367 JP Eq… 115. DAIKIN… 2.91e10 Indust… Electr… NKY 0.0195
## 10 2017-01-13 6367 JP Eq… 115. DAIKIN… 2.91e10 Indust… Electr… NKY 0.00118
## # … with 839 more rows, 1 more variable: Cum_Return <dbl>, and abbreviated
## # variable names ¹Short.Name, ²Market.Cap, ³BICS_LEVEL_1_SECTOR_NAME,
## # ⁴BICS_LEVEL_2_INDUSTRY_GROUP_NAME, ⁵Universe
Note how we can also include tables in our html output:
Daily_TRI_Three %>%
arrange(date) %>%
group_by(Tickers) %>%
mutate(Return = TRI / lag(TRI) - 1) %>%
mutate( Return = coalesce(Return, 0 )) %>%
mutate( CumProd = cumprod(1+Return) ) %>%
filter(date == last(date)) %>% knitr::kable()
date | Tickers | TRI | Short.Name | Market.Cap | BICS_LEVEL_1_SECTOR_NAME | BICS_LEVEL_2_INDUSTRY_GROUP_NAME | Universe | Returns | Return | CumProd |
---|---|---|---|---|---|---|---|---|---|---|
2018-01-31 | 6367 JP Equity | 141.68359 | DAIKIN INDS | 35098091520 | Industrials | Electrical Equipment | NKY | -0.0194480 | -0.0194480 | 1.316599 |
2018-01-31 | 7269 JP Equity | 64.45889 | SUZUKI MOTOR | 27963246592 | Consumer Discretionary | Automotive | NKY | -0.0115126 | -0.0115126 | 1.632005 |
2018-01-31 | 9107 JP Equity | 35.81591 | KAWASAKI KISEN | 2485033728 | Industrials | Transportation & Logistics | NKY | -0.0200886 | -0.0200886 | 1.164372 |
Use the Crypto currency database, fmxdat::cryptos, and do the following:
calculate the monthly return ranking of each crypto, for the last N years.
Then aggregate these consistencies to a single ranking of all the rolling periods
Note: Create an appropriate folder, with a description README and functions stored that do the wrangling.
Using the following data file: fmxdat::Fin_Data_SA_US_NKY
Please note this is a small subset of the universes and so NOT indicative of actual returns!
Create an organized folder called: Index_Return_Comparison
Have a proper README that details every step of your analysis (you will send this to me on github when completed).
Then do the following:
Calculate the cap-weighted monthly index returns for every sector for each country (assume Market Cap applies to first day of the month).
Careful: some stocks might have different last days of the month - calculate monthly returns for each stock’s last day of the month.
Some stocks may have date gaps (i.e. be in index from 2005 - 2009, drop out and reenter in 2013. This will cause return distortions if Px/lag(Px)-1 is simply used.)
This means you should make returns NA if date gap is larger than 35.
Compare the sharpe ratios for the equally- and cap-weighted broad market indexes by country since 2017.
Calculate the JALSH index returns, by filtering the top 95% of Market Cap. (Tip: use arrange and cumsum of weights… careful with your group_bys…)