Up to this point we’ve emphasized the virtues of functional programming, tidy data wrangling and tidy visualization. Today we discuss how to do simple regression ananlyses within a tidy framework. We also explore Johann Pfitzinger’s package tidyfit.
First, let’s look at how regressions (and most modelling commands) are executed in R.
R’s syntax for modelling typically is as follows:
Dependent Var ~ Independent Vars
Or:
Y ~ X + Z
There are a few other symbols worth noting:
Symbol | Ex | Meaning |
---|---|---|
. | Y~. | All Variables included |
+ / - | Y~ X + Z or Y~.-Z | Add or remove variables |
: | Y~ X + Z + X:Z | Include the Interactions |
‘star’ | Y~ X’star’Z | Include variables individually and interactions |
-1 | Y~ X + Z -1 | Remove intercept |
Where ‘star’ above refers to * (it did not want to display in the table as *, sorry).
There are alternatives to lm too, such as glm and rlm (see stats package documentation and for usage see: ?stats::glm).
To show how to estimate a simple linear regression:
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2076 -0.2288 -0.0450 0.2266 1.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.45733 0.30919 11.18 < 2e-16 ***
## Sepal.Width 0.39907 0.09111 4.38 2.24e-05 ***
## Petal.Width 0.97213 0.05210 18.66 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4511 on 147 degrees of freedom
## Multiple R-squared: 0.7072, Adjusted R-squared: 0.7033
## F-statistic: 177.6 on 2 and 147 DF, p-value: < 2.2e-16
Now you are welcome to put the above regression into a nice table yourself, but I would definitely recommend using the package broom to tidy up your results:
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.46 0.309 11.2 2.17e-21
## 2 Sepal.Width 0.399 0.0911 4.38 2.24e- 5
## 3 Petal.Width 0.972 0.0521 18.7 1.34e-40
And to pull the R-squareds, e.g., use the glance call:
Next, we will use the package tidyfit
to analyze factor
loadings of a universe of funds (e.g. to identify style tilts for
individual funds). The primary purpose is to illustrate the usage and
advantages of the tidyfit
package in R
and how
it can help us:
To install tidyfit, run the following code chunk:
If you’d like to do some background reading, have a look at the website. If you run into problems or find bugs in the package please don’t hesitate to contact Johann directly.
Some additional packages are necessary for the models we will fit
(tidyfit
does not install all model-specific dependencies
upfront). You will be prompted along the way to install
bestglm
, shrinkTVP
and arm
.
Let’s begin by loading some data.
# Fund & factor returns
data_raw <- fmxdat::asisa %>%
select(date, Funds, Returns) %>%
left_join(fmxdat::Local_Factors, by = "date") %>%
select(-SWIX, -ALSI, -Multifactor)
data_raw
## # A tibble: 54,000 × 9
## date Funds Returns Capped SWI…¹ Yield Value Momen…² Quality Size
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2002-07-31 Fund_1 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 2 2002-07-31 Fund_2 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 3 2002-07-31 Fund_3 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 4 2002-07-31 Fund_4 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 5 2002-07-31 Fund_5 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 6 2002-07-31 Fund_6 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 7 2002-07-31 Fund_7 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 8 2002-07-31 Fund_8 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 9 2002-07-31 Fund_9 -0.0874 -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 10 2002-07-31 Fund_10 -0.0990 -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## # … with 53,990 more rows, and abbreviated variable names ¹`Capped SWIX`,
## # ²Momentum
We’ve got monthly fund returns plus 6 different factors. Note that
Capped SWIX
is the market factor (i.e. that coefficient
will give us the market beta). This we want to use as one of our
explanatory factors in explaining each fund’s returns using other
factors (such as the yield, value, momentum, etc columns).
Next we will add some other external variables:
# Some external data
df_cmdty <- fmxdat::Commodities %>%
select(date, Name, returns) %>%
spread(Name, returns)
df_usd <- fmxdat::USD_Strength %>%
spread(Name, returns)
# Join
data <- data_raw %>%
left_join(., df_cmdty, by = "date") %>%
left_join(., df_usd, by = "date")
# Handling (i.e. dropping in our case) missing data
data <- data %>%
filter(!is.na(Returns)) %>%
filter(!is.na(BBDXY))
# Build a quick check to see if there's any NA's in a
# column:
NA_Checker_Function <- function(x) {
sum(is.na(x))
}
CheckforNA <- data %>%
apply(., 2, NA_Checker_Function)
Before we begin modeling, we need to take a closer look at the sample. The minimum sample size across dates is 50 funds:
# Plot number of funds by date
data %>%
group_by(date) %>%
summarise(`# of funds` = n()) %>%
ggplot(aes(date, `# of funds`)) + geom_line() + theme_bw()
… however, some funds have very short history.
# Plot number of months by fund
data %>%
group_by(Funds) %>%
summarise(`# of dates` = n()) %>%
arrange(`# of dates`) %>%
head(20)
## # A tibble: 20 × 2
## Funds `# of dates`
## <chr> <int>
## 1 Fund_117 2
## 2 Fund_184 2
## 3 Fund_28 7
## 4 Fund_120 8
## 5 Fund_128 8
## 6 Fund_94 8
## 7 Fund_154 9
## 8 Fund_169 9
## 9 Fund_103 10
## 10 Fund_140 10
## 11 Fund_6 10
## 12 Fund_7 10
## 13 Fund_8 10
## 14 Fund_190 11
## 15 Fund_189 12
## 16 Fund_63 12
## 17 Fund_178 13
## 18 Fund_191 13
## 19 Fund_199 13
## 20 Fund_225 13
We will drop any funds with a sample size less than 4 years (48
months) and that are no longer active. tidyfit
does include
many regularized regression techniques that can handle very small sample
sizes, but this will simplify things a little later on:
This leaves us with a data set of exactly 100 Funds. Now we are ready to get modeling!
tidyfit
has a number of nice features. Have a look at
the repo and the website for a more complete
documentation. One feature is that we can fit a regression on
grouped tibbles. In this case a regression is fitted
for each group in the data. The syntax is extremely
simple:
models <-
data %>%
group_by(Funds) %>% # So we estimate a model for each fund
regress(Returns ~ ., m("lm"), .mask = "date")
models
## # A tibble: 100 × 7
## Funds model estimator_fct `size (MB)` grid_id model_object settings
## <chr> <chr> <chr> <dbl> <chr> <list> <list>
## 1 Fund_1 lm stats::lm 0.263 #0010000 <tidyFit> <tibble>
## 2 Fund_10 lm stats::lm 0.290 #0010000 <tidyFit> <tibble>
## 3 Fund_101 lm stats::lm 0.261 #0010000 <tidyFit> <tibble>
## 4 Fund_107 lm stats::lm 0.290 #0010000 <tidyFit> <tibble>
## 5 Fund_108 lm stats::lm 0.290 #0010000 <tidyFit> <tibble>
## 6 Fund_109 lm stats::lm 0.249 #0010000 <tidyFit> <tibble>
## 7 Fund_111 lm stats::lm 0.288 #0010000 <tidyFit> <tibble>
## 8 Fund_112 lm stats::lm 0.290 #0010000 <tidyFit> <tibble>
## 9 Fund_114 lm stats::lm 0.250 #0010000 <tidyFit> <tibble>
## 10 Fund_115 lm stats::lm 0.270 #0010000 <tidyFit> <tibble>
## # … with 90 more rows
regress
takes (1) a data argument
(piped in this case), (2) a formula (regression of
Returns
on everything else), and (3) a model
wrapper m()
. There are also some additional
arguments like .mask
that help us set things up.
The model wrapper stipulates the method used and can
take any additional arguments to pass to that method. For instance, here
we are fitting OLS regressions using lm()
. To see what
methods you could use here, check ?m()
. It is important to
note, however, that you need to have some familiarity with the
underlying package, in order to know what types of arguments you can
pass. To see what the underlying method is, and for more details about
hyperparameters etc. see ?.model.<method>
(e.g. ?.model.lm
).
The models
object contains fitted models for each fund.
In our case there should be a total of 100 fund-specific models in the
model frame.
We are interested in the loadings, which are accessible with
coef
:
## # A tibble: 1,200 × 5
## # Groups: Funds, model [100]
## Funds model term estimate model_info
## <chr> <chr> <chr> <dbl> <list>
## 1 Fund_1 lm (Intercept) 0.00175 <tibble [1 × 3]>
## 2 Fund_1 lm Capped SWIX 0.372 <tibble [1 × 3]>
## 3 Fund_1 lm Yield -0.119 <tibble [1 × 3]>
## 4 Fund_1 lm Value -0.112 <tibble [1 × 3]>
## 5 Fund_1 lm Momentum 0.118 <tibble [1 × 3]>
## 6 Fund_1 lm Quality 0.0202 <tibble [1 × 3]>
## 7 Fund_1 lm Size 0.404 <tibble [1 × 3]>
## 8 Fund_1 lm Bcom_Index -0.0297 <tibble [1 × 3]>
## 9 Fund_1 lm Gold 0.0770 <tibble [1 × 3]>
## 10 Fund_1 lm Oil_Brent 0.0224 <tibble [1 × 3]>
## # … with 1,190 more rows
Suppose we want to identify some value funds with commodities exposure. Here’s a plot — we’re looking for the funds with positive value loadings and a positive coefficient on the Bcom_Index:
factor_loadings %>%
select(Funds, term, estimate) %>%
spread(term, estimate) %>%
ggplot(aes(Value, Bcom_Index)) + geom_point() + geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) + geom_rect(aes(xmin = 0, xmax = max(Value) +
0.02, ymin = 0, ymax = max(Bcom_Index) + 0.02), color = "red",
alpha = 0) + theme_bw()
A problem is that linear regression is often spurious or imprecise (particularly when T is small). Let’s apply Occam’s Razor here and select sparse models using a backward selection algorithm (i.e. if a factor does not improve the model, it is dropped).
To perform a backward selection, we use the “subset” method in
m()
. This method can do forward selection, backward
selection and exhaustive search. It wraps bestglm
, which
uses the R
standard for these algorithms:
leaps
, but with some additional helpful extensions. Here is
were the power of tidyfit
kicks in — only a slight
adjustment to the code allows as to generate the new models:
models_subset <- data %>%
group_by(Funds) %>%
regress(Returns ~ ., m("subset", method = "backward", IC = "AIC"),
.mask = "date")
models_subset
## # A tibble: 100 × 8
## Funds model estimator_fct size (M…¹ grid_id model_o…² settings warni…³
## <chr> <chr> <chr> <dbl> <chr> <list> <list> <chr>
## 1 Fund_1 subset bestglm::bestglm 0.0533 #00100… <tidyFit> <tibble> <NA>
## 2 Fund_10 subset bestglm::bestglm 0.0814 #00100… <tidyFit> <tibble> <NA>
## 3 Fund_101 subset bestglm::bestglm 0.0486 #00100… <tidyFit> <tibble> <NA>
## 4 Fund_107 subset bestglm::bestglm 0.0903 #00100… <tidyFit> <tibble> model …
## 5 Fund_108 subset bestglm::bestglm 0.0903 #00100… <tidyFit> <tibble> <NA>
## 6 Fund_109 subset bestglm::bestglm 0.0491 #00100… <tidyFit> <tibble> <NA>
## 7 Fund_111 subset bestglm::bestglm 0.0839 #00100… <tidyFit> <tibble> model …
## 8 Fund_112 subset bestglm::bestglm 0.0815 #00100… <tidyFit> <tibble> <NA>
## 9 Fund_114 subset bestglm::bestglm 0.0423 #00100… <tidyFit> <tibble> <NA>
## 10 Fund_115 subset bestglm::bestglm 0.0628 #00100… <tidyFit> <tibble> <NA>
## # … with 90 more rows, and abbreviated variable names ¹`size (MB)`,
## # ²model_object, ³warnings
Once again we fetch the coefficients and plot:
factor_loadings_subset <- coef(models_subset)
factor_loadings_subset %>%
select(Funds, term, estimate) %>%
spread(term, estimate) %>%
drop_na(Value, Bcom_Index) %>%
ggplot(aes(Value, Bcom_Index)) + geom_point() + geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) + geom_rect(aes(xmin = 0, xmax = max(Value) +
0.02, ymin = 0, ymax = max(Bcom_Index) + 0.02), color = "red",
alpha = 0) + theme_bw()
Lo and behold, we have only one fund in our quadrant of interest, and very few with robust loadings on value and commodities. Let’s see which fund this is:
factor_loadings_subset %>%
select(Funds, term, estimate) %>%
spread(term, estimate) %>%
filter(Value > 0, Bcom_Index > 0)
## # A tibble: 1 × 14
## # Groups: Funds, model [1]
## model Funds (Intercep…¹ BBDXY Bcom_…² Cappe…³ Gold Momen…⁴ Oil_B…⁵ Plat
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 subset Fund_50 -0.0000791 NA 0.0641 0.774 -0.0359 NA NA NA
## # … with 4 more variables: Quality <dbl>, Size <dbl>, Value <dbl>, Yield <dbl>,
## # and abbreviated variable names ¹`(Intercept)`, ²Bcom_Index, ³`Capped SWIX`,
## # ⁴Momentum, ⁵Oil_Brent
Now we may be interested to explore style drift in
this fund. Let’s estimate the loadings for this specific fund in a
time-varying parameter framework. Again,
tidyfit
helps us out here:
# Notice how we no longer need .mask='date', because we are
# passing the arg index_col='date'. This ensures that for
# time-varying parameter estimation, the result to coef()
# contains a date index.
if (!require("shrinkTVP")) install.packages("shrinkTVP")
models_tvp <- data %>%
filter(Funds == "Fund_50") %>%
regress(Returns ~ Value + `Capped SWIX` + Bcom_Index, m("tvp",
niter = 5000, index_col = "date"))
We can fetch and plot the coefficients over time:
factor_loadings_tvp <- coef(models_tvp)
factor_loadings_tvp <- factor_loadings_tvp %>%
unnest(model_info)
factor_loadings_tvp
## # A tibble: 604 × 7
## # Groups: model [1]
## model term estimate upper lower posterior.sd index
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <date>
## 1 tvp (Intercept) -0.000140 0.000639 -0.00131 0.000718 2005-01-31
## 2 tvp (Intercept) -0.000150 0.000644 -0.00138 0.000742 2005-02-28
## 3 tvp (Intercept) -0.000161 0.000675 -0.00146 0.000773 2005-03-31
## 4 tvp (Intercept) -0.000165 0.000670 -0.00148 0.000772 2005-05-31
## 5 tvp (Intercept) -0.000168 0.000669 -0.00150 0.000767 2005-06-30
## 6 tvp (Intercept) -0.000171 0.000661 -0.00152 0.000777 2005-08-31
## 7 tvp (Intercept) -0.000174 0.000694 -0.00151 0.000793 2005-09-30
## 8 tvp (Intercept) -0.000171 0.000676 -0.00150 0.000792 2005-10-31
## 9 tvp (Intercept) -0.000165 0.000706 -0.00150 0.000779 2005-11-30
## 10 tvp (Intercept) -0.000164 0.000697 -0.00155 0.000768 2006-01-31
## # … with 594 more rows
factor_loadings_tvp %>%
ggplot(aes(index, estimate)) + geom_ribbon(aes(ymax = upper,
ymin = lower), alpha = 0.25) + geom_line() + facet_wrap("term",
scales = "free") + theme_bw()
The value loading has been steadily increasing over the period of the funds existence. Similarly, the commodities loading reflects an upward trend. The credible interval (obtained by taking the 90% quantile interval of the beta posteriors) is very wide.
Now, we may ask whether the backward selection is actually any good
at predicting returns. Let’s use a predictive workflow to assess the
out-of-sample performance of the models. Here I will introduce two
additional methods into the mix. The first is a LASSO
regression, which is estimated with glmnet
. The
LASSO also estimates sparse models, but unlike backward selection it is
an embedded method (i.e. it estimates parameters and performs variable
selection simultaneously and not sequentially like the subset selection
algorithms). The second method is a Bayesian regression
using arm
.
The LASSO regression requires hyperparameter tuning. A reasonable
grid for the penalty parameter is chosen automatically (but could also
be passed manually using
m("lasso", lambda = <grid vector>)
). However, we need
to define a .cv
argument to tell regress
how
to cross validate the grid. tidyfit
leverages the
rsample
package to build cross validation samples. Here we
will use a simple train/validate split with time ordered data:
.cv = initial_time_split
.
# IMPORTANT: Chunk takes a little longer to run
if (!require("bestglm")) install.packages("bestglm")
if (!require("arm")) install.packages("arm")
models <- data %>%
filter(year(date) < 2021) %>%
regress(Returns ~ ., m("lm"), m("subset", method = "backward",
IC = "AIC"), m("lasso"), m("bayes"), .mask = "date",
.cv = "initial_time_split")
Next we generate out-of-sample predictions. Here we can simply use
predict
:
Now, calculating the accuracy is made very easy using the
yardstick
package. It is important to note that the
predictions tibble is already grouped appropriately. If this was not the
case, you would still need to group by model
and
Funds
.
if (!require("yardstick")) install.packages("yardstick")
acc <- predictions %>%
yardstick::rmse(truth, prediction)
Let’s plot the result. To make it easier to compare, I will present all metrics relative to “lm”. So a positive value is a larger error, and a negative value is a smaller error:
acc %>%
mutate(.estimate = .estimate - .estimate[model == "lm"]) %>%
filter(model != "lm") %>%
ggplot(aes(model, .estimate)) + geom_hline(yintercept = 0,
color = "red") + geom_boxplot() + theme_bw()
Interesting result: backward selection does not actually improve things at all. The Bayesian regression does a better job than OLS and the LASSO certainly does as well. I have seen this quite often: LASSO and other embedded methods are much better than sequential methods at variable selection.
We’ve gone through quite a lot of analysis in a very short time. Now
it is your turn to get familiar with tidyfit
and try out
some of your own workflows. Here are some ideas:
m("quantile")
)classify
m("glmm")
). This requires some knowledge of
lme4
.