Evolving Themescapes: Powerful Auto-ML for Thematic Investment with tidyfit

The recent years have been marked by an unusual amount of geopolitical upheaval and crisis. In this post, I explore the change in importance that this period has elicited in different investment themes. Which trends have grown in importance? What can be discovered about evolving market priorities and the brave new world ahead?

To explore these questions, I draw on a data set of MSCI Thematic and Sector index returns, and calculate the regression-based importance of each theme for each sector over time. The analytical workflow is typical to the quantitative finance setting, essentially requiring the estimation of a large number of linear regressions that provide orthogonal exposures to different investment themes. Here the R package tidyfit (available on CRAN) can be extremely helpful, since it automates much of the machine learning pipeline for regularized regressions (Pfitzinger 2022).

MSCI provides thematic equity indexes for 17 different themes that range from digital health and cybersecurity to millennials and future education. The following plot shows the average change in each theme’s importance — measured as the change in the absolute standardized beta — from before the COVID-19 pandemic to after the pandemic. The regression betas are estimated using an elastic net regression (discussed below). A positive value suggests that the theme has, on average, increased in recent years:

The plot shows some clear winners and losers: Digital Health and Ageing Society Opportunities have gained significant traction, while Millennials and Future Mobility have lost in importance across a majority of sectors. For many themes the result is not clear-cut when examined across all sectors. Before drilling into these results and exploring, which sectors the various themes have gained and waned in, let’s examine how we got here, and importantly, how with tidyfit generating the above result is possible with as little as three lines of code.

Auto-ML with tidyfit

The data set consists of monthly returns for 10 ACWI1 sector indexes, which are regressed on 17 ACWI thematic indexes and the MSCI ACWI market index (see here). Here is a snapshot of the data:

## # A tibble: 910 × 21
## # Groups:   Sector [13]
##    Date       Sector   Return ACCELE…¹    ACWI AGEIN…² AUTON…³ CYBERS…⁴ DIGITA…⁵
##    <date>     <chr>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1 2016-12-30 CYCLIC… 0.0178  -5.94e-4 0.0215  0.0166   0.0126 -0.0202  -0.00391
##  2 2017-01-31 CYCLIC… 0.0371   6.84e-2 0.0271  0.0321   0.0552  0.0745   0.0686 
##  3 2017-02-28 CYCLIC… 0.0272   4.08e-2 0.0276  0.0425   0.0337  0.0141   0.0361 
##  4 2017-03-31 CYCLIC… 0.0149   2.35e-2 0.0118  0.00864  0.0270  0.0223   0.0328 
##  5 2017-04-28 CYCLIC… 0.0202   2.99e-2 0.0161  0.0249   0.0152  0.0256   0.0360 
##  6 2017-05-31 CYCLIC… 0.0204   3.17e-2 0.0200  0.0258   0.0466  0.0538   0.0608 
##  7 2017-06-30 CYCLIC… 0.00922  1.47e-2 0.00590 0.0247  -0.0108 -0.00495 -0.00522
##  8 2017-07-31 CYCLIC… 0.0339   2.27e-2 0.0274  0.0220   0.0330 -0.0111   0.0652 
##  9 2017-08-31 CYCLIC… 0.00538  3.53e-2 0.00354 0.00425  0.0236  0.0121   0.0280 
## 10 2017-09-29 CYCLIC… 0.0214   9.20e-3 0.0214  0.00299  0.0296  0.0367   0.0118 
## # … with 900 more rows, 12 more variables: `DIGITAL HEALTH` <dbl>,
## #   `SMART CITIES` <dbl>, and abbreviated variable names

The tidyfit package provides a wrapper for a large number of machine learning and statistical regression techniques. Given the short history of the thematic index data (December 2016), an OLS estimate of the thematic loadings is likely to be extremely imprecise and misleading. To reduce the variance of the OLS estimate, I instead use a regularized regression approach: the elastic net, which permits sparsity in the loading vector. Specifically, I will compare thematic loadings before and after the COVID-19 pandemic, using the below regression (where \(r_t\) denotes daily returns): \[ \tilde{r}_{\text{sector},t} \sim \beta_0 + \sum_{i \in \text{themes}}\beta_ir_{i,t} + \varepsilon_t,\;\;\;r_{\text{sector},t} = r_{\text{sector},t} - r_{\text{acwi,t}} \] The periods before and after the pandemic are obtained by creating an indicator variable labelled Period:

# Split data into before and after periods
df <- df %>% 
  mutate(Period = ifelse(Date < as.Date("2020-03-01"), "before", "after")) %>% 
  # Calculate active sector returns
  mutate(Return = Return - ACWI) %>% 

Now I fit elastic net regressions for each sector and for each period, using leave-one-out cross validation to determine optimal hyperparameter values. tidyfit makes this analysis extremely simple, with a single line of code to optimize all the models:

# Chunk can take long to execute - enable progress bar
progressr::handlers(global = TRUE)
# Enable parallel computation
plan(multisession, workers = 8)
# Normalize the data and fit the model
fit <- df %>%
  # Estimate regressions for each sector and period
  group_by(Sector, Period) %>% 
  # Normalize data to obtain standardized coefficients
  mutate_at(vars(-Date, -Sector, -Period), BBmisc::normalize)
  # Fit models
  regress(Return ~ ., ElasticNet = m("enet"), .cv = "loo_cv", .mask = "Date")

There are a few things to unpack here. tidyfit automatically fits separate regressions for each sector and period using glmnet given appropriate grouping variables (Simon et al. 2011). The methods are simply passed as name-function pairs (ElasticNet = m("enet") — see ?m for an overview of methods). .cv and .cv_args set up the cross validation (note that this simply uses rsample::loo_cv in the background (Silge et al. 2022)). If we wanted to obtain a single optimal hyperparameter setting across all groups — this could be argued to improve comparability — simply set .tune_each_group = FALSE, allowing algorithms to be tuned for the entire system, rather than each group. The final argument .mask = "Date" ensures that the date column is not treated as a regressor.

The coefficients are now obtained using standard generics:

coefs <- coef(fit)

Exploring sector-level exposures

The plot at the beginning of the article suggests substantial variation in exposure shifts across sectors. The change of the absolute exposure of each sector to each theme is visualized using theme-by-sector heatmap. Themes that have declined in importance for a sector are indicated by red hues, while blue hues indicate an increase in the theme’s importance.

Examine, for instance, the case of Digital Economy — a particularly interesting theme in the post-pandemic world. The theme has grown in importance overall, but particularly in the IT and Communication Services sectors. Other interesting developments are the decline in importance of the Future Mobility theme for the Materials sector, in favor of (among others) Food Revolution and Accelerating Change, or the rotation in the Millennials theme away from consumer goods.

Evolving thematic exposures

Comparing periods before and after the pandemic as I do above, may be somewhat crude and gloss over changes occurring in the intervening years. To explore this aspect, once could estimate thematic exposures that evolve over time in a time-varying parameter framework, with \[ \tilde{r}_{\text{sector},t} \sim \beta_{t,0} + \sum_{i \in \text{themes}}\beta_{t,i}r_{i,t} + \varepsilon_t,\;\;\;r_{\text{sector},t} = r_{\text{sector},t} - r_{\text{acwi,t}} \] tidyfit uses shrinkTVP to estimate time-varying parameter models (Knaus et al. 2021). The automatically selects between constant, time-varying and sparse coefficients, providing a flexible and strongly regularized Bayesian framework with which to explore evolving thematic exposures. The code chunk below replaces m("enet") with m("tvp"). Note that we do not need to group by Period any more, and instead pass index_col = "Date" to indicate that the Date column should be used as an index. The additional arguments are passed to shrinkTVP::shrinkTVP and specify a hierarchical Bayesian lasso regression (see ?shrinkTVP):

fit_tvp <- df %>%
  # Estimate regressions for each sector and period
  group_by(Sector) %>% 
  # Fit models
  regress(Return ~ 0+., 
          TVP = m("tvp", learn_a_xi = FALSE, learn_a_tau = FALSE, a_xi = 1, a_tau = 1, index_col = "Date"), 
          .mask = "Period")

coefs_tvp <- coef(fit_tvp)

Instead of exploring the wealth of trends and insights captured by the above regression, the results are merely sampled here, using two examples. The left-hand plot shows the increase in exposure the Digital Economy theme for the Communication Services sector. The Bayesian estimates reflect a high degree of uncertainty — unsurprising given the the flexible parameterization. The right-hand plot shows an interesting change in the exposure trend, with the importance of the Robotics theme initially growing for the Industrials sector, and subsequently declining from mid-2020:

Final thoughts

While the results have been very easy to generate with the powerful tidyfit framework, the complexity of the problem and the limited data availability can invariably lead to spurious results and should be treated with care. Nonetheless, by comparing quickly across different regression algorithms and examining the cross-sector averages, broad trends can be gleaned from the data, which suggest substantial movement in the global equity themescape in the post-pandemic world.


Knaus, Peter, Angela Bitto-Nemling, Annalisa Cadonna, and Sylvia Frühwirth-Schnatter. 2021. “Shrinkage in the Time-Varying Parameter Model Framework Using the R Package shrinkTVP.” Journal of Statistical Software 100 (13): 1–32. https://doi.org/10.18637/jss.v100.i13.
Pfitzinger, Johann. 2022. Tidyfit: Regularized Linear Modeling with Tidy Data. https://CRAN.R-project.org/package=tidyfit.
Silge, Julia, Fanny Chow, Max Kuhn, and Hadley Wickham. 2022. Rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.
Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. https://doi.org/10.18637/jss.v039.i05.

  1. This refers to the MSCI All Countries World Index.↩︎


Related Articles