Tidy Regressions in R

Load data
Explore data
Fitting linear regression
Backward selection
How stable are the loadings?
Is the smaller model actually better?
Now it’s your turn…

Tidy Regressions in R

Up to this point we’ve emphasized the virtues of functional programming, tidy data wrangling and tidy visualization. Today we discuss how to do simple regression ananlyses within a tidy framework. We also explore Johann Pfitzinger’s package tidyfit.

First: a note on regressions in R

First, let’s look at how regressions (and most modelling commands) are executed in R.

R’s syntax for modelling typically is as follows:

Dependent Var ~ Independent Vars

Or:

Y ~ X + Z

There are a few other symbols worth noting:

Symbol	Ex	Meaning
.	Y~.	All Variables included
+ / -	Y~ X + Z or Y~.-Z	Add or remove variables
:	Y~ X + Z + X:Z	Include the Interactions
‘star’	Y~ X’star’Z	Include variables individually and interactions
-1	Y~ X + Z -1	Remove intercept

Where ‘star’ above refers to * (it did not want to display in the table as *, sorry).

There are alternatives to lm too, such as glm and rlm (see stats package documentation and for usage see: ?stats::glm).

To show how to estimate a simple linear regression:

reg <- lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Width)
names(reg)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

summary(reg)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2076 -0.2288 -0.0450  0.2266  1.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.45733    0.30919   11.18  < 2e-16 ***
## Sepal.Width  0.39907    0.09111    4.38 2.24e-05 ***
## Petal.Width  0.97213    0.05210   18.66  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4511 on 147 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.7033 
## F-statistic: 177.6 on 2 and 147 DF,  p-value: < 2.2e-16

coefs <- reg$coefficients

Now you are welcome to put the above regression into a nice table yourself, but I would definitely recommend using the package broom to tidy up your results:

pacman::p_load("broom")
library(tidyverse)
tidy(reg)

## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    3.46     0.309      11.2  2.17e-21
## 2 Sepal.Width    0.399    0.0911      4.38 2.24e- 5
## 3 Petal.Width    0.972    0.0521     18.7  1.34e-40

And to pull the R-squareds, e.g., use the glance call:

Rsqd <- broom::glance(reg) %>%
    select(contains("r.squared"))

Next, we will use the package tidyfit to analyze factor loadings of a universe of funds (e.g. to identify style tilts for individual funds). The primary purpose is to illustrate the usage and advantages of the tidyfit package in R and how it can help us:

to streamline regression workflows making our analysis much more efficient, and
to access a large number of regression and classification methods through a standardized interface.

To install tidyfit, run the following code chunk:

# devtools::install_github('jpfitzinger/tidyfit')
library(tidyfit)

If you’d like to do some background reading, have a look at the website. If you run into problems or find bugs in the package please don’t hesitate to contact Johann directly.

Some additional packages are necessary for the models we will fit (tidyfit does not install all model-specific dependencies upfront). You will be prompted along the way to install bestglm, shrinkTVP and arm.

Let’s begin by loading some data.

Load data

# Run the following to update the used data....
# devtools::install_github('Nicktz/fmxdat')

# Fund & factor returns
data_raw <- fmxdat::asisa %>%
    select(date, Funds, Returns) %>%
    left_join(fmxdat::Local_Factors, by = "date") %>%
    select(-SWIX, -ALSI, -Multifactor)

data_raw

## # A tibble: 54,000 × 9
##    date       Funds   Returns Capped SWI…¹   Yield  Value Momen…² Quality   Size
##    <date>     <chr>     <dbl>        <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
##  1 2002-07-31 Fund_1  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  2 2002-07-31 Fund_2  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  3 2002-07-31 Fund_3  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  4 2002-07-31 Fund_4  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  5 2002-07-31 Fund_5  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  6 2002-07-31 Fund_6  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  7 2002-07-31 Fund_7  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  8 2002-07-31 Fund_8  NA            -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
##  9 2002-07-31 Fund_9  -0.0874       -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 10 2002-07-31 Fund_10 -0.0990       -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## # … with 53,990 more rows, and abbreviated variable names ¹`Capped SWIX`,
## #   ²Momentum

We’ve got monthly fund returns plus 6 different factors. Note that Capped SWIX is the market factor (i.e. that coefficient will give us the market beta). This we want to use as one of our explanatory factors in explaining each fund’s returns using other factors (such as the yield, value, momentum, etc columns).

Next we will add some other external variables:

# Some external data
df_cmdty <- fmxdat::Commodities %>%
    select(date, Name, returns) %>%
    spread(Name, returns)
df_usd <- fmxdat::USD_Strength %>%
    spread(Name, returns)

# Join
data <- data_raw %>%
    left_join(., df_cmdty, by = "date") %>%
    left_join(., df_usd, by = "date")

# Handling (i.e. dropping in our case) missing data
data <- data %>%
    filter(!is.na(Returns)) %>%
    filter(!is.na(BBDXY))

# Build a quick check to see if there's any NA's in a
# column:

NA_Checker_Function <- function(x) {
    sum(is.na(x))
}

CheckforNA <- data %>%
    apply(., 2, NA_Checker_Function)

Explore data

Before we begin modeling, we need to take a closer look at the sample. The minimum sample size across dates is 50 funds:

# Plot number of funds by date
data %>%
    group_by(date) %>%
    summarise(`# of funds` = n()) %>%
    ggplot(aes(date, `# of funds`)) + geom_line() + theme_bw()

… however, some funds have very short history.

# Plot number of months by fund
data %>%
    group_by(Funds) %>%
    summarise(`# of dates` = n()) %>%
    arrange(`# of dates`) %>%
    head(20)

## # A tibble: 20 × 2
##    Funds    `# of dates`
##    <chr>           <int>
##  1 Fund_117            2
##  2 Fund_184            2
##  3 Fund_28             7
##  4 Fund_120            8
##  5 Fund_128            8
##  6 Fund_94             8
##  7 Fund_154            9
##  8 Fund_169            9
##  9 Fund_103           10
## 10 Fund_140           10
## 11 Fund_6             10
## 12 Fund_7             10
## 13 Fund_8             10
## 14 Fund_190           11
## 15 Fund_189           12
## 16 Fund_63            12
## 17 Fund_178           13
## 18 Fund_191           13
## 19 Fund_199           13
## 20 Fund_225           13

We will drop any funds with a sample size less than 4 years (48 months) and that are no longer active. tidyfit does include many regularized regression techniques that can handle very small sample sizes, but this will simplify things a little later on:

data <- data %>%
    group_by(Funds) %>%
    filter(n() > 48, max(date) == max(data$date)) %>%
    ungroup()

This leaves us with a data set of exactly 100 Funds. Now we are ready to get modeling!

Fitting linear regression

tidyfit has a number of nice features. Have a look at the repo and the website for a more complete documentation. One feature is that we can fit a regression on grouped tibbles. In this case a regression is fitted for each group in the data. The syntax is extremely simple:

models <- 
  data %>% 
  group_by(Funds) %>%   # So we estimate a model for each fund
  regress(Returns ~ ., m("lm"), .mask = "date")
models

## # A tibble: 100 × 7
##    Funds    model estimator_fct `size (MB)` grid_id  model_object settings
##    <chr>    <chr> <chr>               <dbl> <chr>    <list>       <list>  
##  1 Fund_1   lm    stats::lm           0.263 #0010000 <tidyFit>    <tibble>
##  2 Fund_10  lm    stats::lm           0.290 #0010000 <tidyFit>    <tibble>
##  3 Fund_101 lm    stats::lm           0.261 #0010000 <tidyFit>    <tibble>
##  4 Fund_107 lm    stats::lm           0.290 #0010000 <tidyFit>    <tibble>
##  5 Fund_108 lm    stats::lm           0.290 #0010000 <tidyFit>    <tibble>
##  6 Fund_109 lm    stats::lm           0.249 #0010000 <tidyFit>    <tibble>
##  7 Fund_111 lm    stats::lm           0.288 #0010000 <tidyFit>    <tibble>
##  8 Fund_112 lm    stats::lm           0.290 #0010000 <tidyFit>    <tibble>
##  9 Fund_114 lm    stats::lm           0.250 #0010000 <tidyFit>    <tibble>
## 10 Fund_115 lm    stats::lm           0.270 #0010000 <tidyFit>    <tibble>
## # … with 90 more rows

regress takes (1) a data argument (piped in this case), (2) a formula (regression of Returns on everything else), and (3) a model wrapper m(). There are also some additional arguments like .mask that help us set things up.

The model wrapper stipulates the method used and can take any additional arguments to pass to that method. For instance, here we are fitting OLS regressions using lm(). To see what methods you could use here, check ?m(). It is important to note, however, that you need to have some familiarity with the underlying package, in order to know what types of arguments you can pass. To see what the underlying method is, and for more details about hyperparameters etc. see ?.model.<method> (e.g. ?.model.lm).

The models object contains fitted models for each fund. In our case there should be a total of 100 fund-specific models in the model frame.

We are interested in the loadings, which are accessible with coef:

factor_loadings <- coef(models)
factor_loadings

## # A tibble: 1,200 × 5
## # Groups:   Funds, model [100]
##    Funds  model term        estimate model_info      
##    <chr>  <chr> <chr>          <dbl> <list>          
##  1 Fund_1 lm    (Intercept)  0.00175 <tibble [1 × 3]>
##  2 Fund_1 lm    Capped SWIX  0.372   <tibble [1 × 3]>
##  3 Fund_1 lm    Yield       -0.119   <tibble [1 × 3]>
##  4 Fund_1 lm    Value       -0.112   <tibble [1 × 3]>
##  5 Fund_1 lm    Momentum     0.118   <tibble [1 × 3]>
##  6 Fund_1 lm    Quality      0.0202  <tibble [1 × 3]>
##  7 Fund_1 lm    Size         0.404   <tibble [1 × 3]>
##  8 Fund_1 lm    Bcom_Index  -0.0297  <tibble [1 × 3]>
##  9 Fund_1 lm    Gold         0.0770  <tibble [1 × 3]>
## 10 Fund_1 lm    Oil_Brent    0.0224  <tibble [1 × 3]>
## # … with 1,190 more rows

Suppose we want to identify some value funds with commodities exposure. Here’s a plot — we’re looking for the funds with positive value loadings and a positive coefficient on the Bcom_Index:

factor_loadings %>%
    select(Funds, term, estimate) %>%
    spread(term, estimate) %>%
    ggplot(aes(Value, Bcom_Index)) + geom_point() + geom_hline(yintercept = 0) +
    geom_vline(xintercept = 0) + geom_rect(aes(xmin = 0, xmax = max(Value) +
    0.02, ymin = 0, ymax = max(Bcom_Index) + 0.02), color = "red",
    alpha = 0) + theme_bw()

A problem is that linear regression is often spurious or imprecise (particularly when T is small). Let’s apply Occam’s Razor here and select sparse models using a backward selection algorithm (i.e. if a factor does not improve the model, it is dropped).

Backward selection

# install.packages('bestglm')

To perform a backward selection, we use the “subset” method in m(). This method can do forward selection, backward selection and exhaustive search. It wraps bestglm, which uses the R standard for these algorithms: leaps, but with some additional helpful extensions. Here is were the power of tidyfit kicks in — only a slight adjustment to the code allows as to generate the new models:

models_subset <- data %>%
    group_by(Funds) %>%
    regress(Returns ~ ., m("subset", method = "backward", IC = "AIC"),
        .mask = "date")
models_subset

## # A tibble: 100 × 8
##    Funds    model  estimator_fct    size (M…¹ grid_id model_o…² settings warni…³
##    <chr>    <chr>  <chr>                <dbl> <chr>   <list>    <list>   <chr>  
##  1 Fund_1   subset bestglm::bestglm    0.0533 #00100… <tidyFit> <tibble> <NA>   
##  2 Fund_10  subset bestglm::bestglm    0.0814 #00100… <tidyFit> <tibble> <NA>   
##  3 Fund_101 subset bestglm::bestglm    0.0486 #00100… <tidyFit> <tibble> <NA>   
##  4 Fund_107 subset bestglm::bestglm    0.0903 #00100… <tidyFit> <tibble> model …
##  5 Fund_108 subset bestglm::bestglm    0.0903 #00100… <tidyFit> <tibble> <NA>   
##  6 Fund_109 subset bestglm::bestglm    0.0491 #00100… <tidyFit> <tibble> <NA>   
##  7 Fund_111 subset bestglm::bestglm    0.0839 #00100… <tidyFit> <tibble> model …
##  8 Fund_112 subset bestglm::bestglm    0.0815 #00100… <tidyFit> <tibble> <NA>   
##  9 Fund_114 subset bestglm::bestglm    0.0423 #00100… <tidyFit> <tibble> <NA>   
## 10 Fund_115 subset bestglm::bestglm    0.0628 #00100… <tidyFit> <tibble> <NA>   
## # … with 90 more rows, and abbreviated variable names ¹`size (MB)`,
## #   ²model_object, ³warnings

Once again we fetch the coefficients and plot:

factor_loadings_subset <- coef(models_subset)
factor_loadings_subset %>%
    select(Funds, term, estimate) %>%
    spread(term, estimate) %>%
    drop_na(Value, Bcom_Index) %>%
    ggplot(aes(Value, Bcom_Index)) + geom_point() + geom_hline(yintercept = 0) +
    geom_vline(xintercept = 0) + geom_rect(aes(xmin = 0, xmax = max(Value) +
    0.02, ymin = 0, ymax = max(Bcom_Index) + 0.02), color = "red",
    alpha = 0) + theme_bw()

Lo and behold, we have only one fund in our quadrant of interest, and very few with robust loadings on value and commodities. Let’s see which fund this is:

factor_loadings_subset %>%
    select(Funds, term, estimate) %>%
    spread(term, estimate) %>%
    filter(Value > 0, Bcom_Index > 0)

## # A tibble: 1 × 14
## # Groups:   Funds, model [1]
##   model  Funds   (Intercep…¹ BBDXY Bcom_…² Cappe…³    Gold Momen…⁴ Oil_B…⁵  Plat
##   <chr>  <chr>         <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 subset Fund_50  -0.0000791    NA  0.0641   0.774 -0.0359      NA      NA    NA
## # … with 4 more variables: Quality <dbl>, Size <dbl>, Value <dbl>, Yield <dbl>,
## #   and abbreviated variable names ¹`(Intercept)`, ²Bcom_Index, ³`Capped SWIX`,
## #   ⁴Momentum, ⁵Oil_Brent

How stable are the loadings?

# install.packages('shrinkTVP')

Now we may be interested to explore style drift in this fund. Let’s estimate the loadings for this specific fund in a time-varying parameter framework. Again, tidyfit helps us out here:

# Notice how we no longer need .mask='date', because we are
# passing the arg index_col='date'. This ensures that for
# time-varying parameter estimation, the result to coef()
# contains a date index.
if (!require("shrinkTVP")) install.packages("shrinkTVP")

models_tvp <- data %>%
    filter(Funds == "Fund_50") %>%
    regress(Returns ~ Value + `Capped SWIX` + Bcom_Index, m("tvp",
        niter = 5000, index_col = "date"))

We can fetch and plot the coefficients over time:

factor_loadings_tvp <- coef(models_tvp)
factor_loadings_tvp <- factor_loadings_tvp %>%
    unnest(model_info)
factor_loadings_tvp

## # A tibble: 604 × 7
## # Groups:   model [1]
##    model term         estimate    upper    lower posterior.sd index     
##    <chr> <chr>           <dbl>    <dbl>    <dbl>        <dbl> <date>    
##  1 tvp   (Intercept) -0.000140 0.000639 -0.00131     0.000718 2005-01-31
##  2 tvp   (Intercept) -0.000150 0.000644 -0.00138     0.000742 2005-02-28
##  3 tvp   (Intercept) -0.000161 0.000675 -0.00146     0.000773 2005-03-31
##  4 tvp   (Intercept) -0.000165 0.000670 -0.00148     0.000772 2005-05-31
##  5 tvp   (Intercept) -0.000168 0.000669 -0.00150     0.000767 2005-06-30
##  6 tvp   (Intercept) -0.000171 0.000661 -0.00152     0.000777 2005-08-31
##  7 tvp   (Intercept) -0.000174 0.000694 -0.00151     0.000793 2005-09-30
##  8 tvp   (Intercept) -0.000171 0.000676 -0.00150     0.000792 2005-10-31
##  9 tvp   (Intercept) -0.000165 0.000706 -0.00150     0.000779 2005-11-30
## 10 tvp   (Intercept) -0.000164 0.000697 -0.00155     0.000768 2006-01-31
## # … with 594 more rows

factor_loadings_tvp %>%
    ggplot(aes(index, estimate)) + geom_ribbon(aes(ymax = upper,
    ymin = lower), alpha = 0.25) + geom_line() + facet_wrap("term",
    scales = "free") + theme_bw()

The value loading has been steadily increasing over the period of the funds existence. Similarly, the commodities loading reflects an upward trend. The credible interval (obtained by taking the 90% quantile interval of the beta posteriors) is very wide.

Is the smaller model actually better?

# install.packages('arm')

Now, we may ask whether the backward selection is actually any good at predicting returns. Let’s use a predictive workflow to assess the out-of-sample performance of the models. Here I will introduce two additional methods into the mix. The first is a LASSO regression, which is estimated with glmnet. The LASSO also estimates sparse models, but unlike backward selection it is an embedded method (i.e. it estimates parameters and performs variable selection simultaneously and not sequentially like the subset selection algorithms). The second method is a Bayesian regression using arm.

The LASSO regression requires hyperparameter tuning. A reasonable grid for the penalty parameter is chosen automatically (but could also be passed manually using m("lasso", lambda = <grid vector>)). However, we need to define a .cv argument to tell regress how to cross validate the grid. tidyfit leverages the rsample package to build cross validation samples. Here we will use a simple train/validate split with time ordered data: .cv = initial_time_split.

# IMPORTANT: Chunk takes a little longer to run

if (!require("bestglm")) install.packages("bestglm")
if (!require("arm")) install.packages("arm")

models <- data %>%
    filter(year(date) < 2021) %>%
    regress(Returns ~ ., m("lm"), m("subset", method = "backward",
        IC = "AIC"), m("lasso"), m("bayes"), .mask = "date",
        .cv = "initial_time_split")

Next we generate out-of-sample predictions. Here we can simply use predict:

predictions <- data %>%
    filter(year(date) >= 2021) %>%
    predict(models, .)

Now, calculating the accuracy is made very easy using the yardstick package. It is important to note that the predictions tibble is already grouped appropriately. If this was not the case, you would still need to group by model and Funds.

if (!require("yardstick")) install.packages("yardstick")

acc <- predictions %>%
    yardstick::rmse(truth, prediction)

Let’s plot the result. To make it easier to compare, I will present all metrics relative to “lm”. So a positive value is a larger error, and a negative value is a smaller error:

acc %>%
    mutate(.estimate = .estimate - .estimate[model == "lm"]) %>%
    filter(model != "lm") %>%
    ggplot(aes(model, .estimate)) + geom_hline(yintercept = 0,
    color = "red") + geom_boxplot() + theme_bw()

Interesting result: backward selection does not actually improve things at all. The Bayesian regression does a better job than OLS and the LASSO certainly does as well. I have seen this quite often: LASSO and other embedded methods are much better than sequential methods at variable selection.

Now it’s your turn…

We’ve gone through quite a lot of analysis in a very short time. Now it is your turn to get familiar with tidyfit and try out some of your own workflows. Here are some ideas:

Use an ElasticNet (“enet”) to estimate factor loadings and identify some pure style funds with exposure to only one factor
Perform interval return predictions with a quantile regression (m("quantile"))
Forecast fund performance (positive vs negative performance, one-month-ahead) using classify
Estimate variable factor exposures under positive vs. negative dollar strength using fixed and/or random effects (m("glmm")). This requires some knowledge of lme4.
Estimate factor loadings with a LASSO regression, and bootstrap confidence intervals (see here for an example)
… or think of your own question to drill into

Applied Data Science