Introduction to the Hierarchical Feature Regression

Linear regression modeling with high-dimensional or multicollinear data sets is prone to overfitting. As a general rule, the more parameters a model uses to predict its target, the higher the risk of overfitting. More parameters can either take the form of more features in your data set, or of more internal model flexibility if you are working with nonlinear methods.

The number of parameters in your model is called the model’s effective degrees of freedom, which I will denote \(\nu\).1 For an OLS regression \(\nu = k+1\), where \(k\) is the number of features in the data set. For a nonlinear algorithm like a random forest \(\nu\) can be significantly higher than the number of features.2 Conversely, for a regularized linear regression, \(\nu\) can be much smaller than \(k\) without reducing the actual number of columns in the data set. One key to robust out-of-sample performance is to capture as much signal as possible with as small a \(\nu\) as possible.

There are many different approaches to reducing the model degrees of freedom (this is arguably a large part of what the disciplines of machine learning and Bayesian analysis are all about). In the linear regression context these include subset selection, \(\ell_1\)/\(\ell_2\)-norm penalties, Bayesian regression, dimension reduction, and several more. Penalized regressions like the lasso (Tibshirani 1996) or the elastic net (Zou and Hastie 2005) are particularly useful since they perform simultaneous model selection and fitting. Penalties are typically imposed on the parameter norm, meaning parameter values are shrunken towards zero.

In this recently published paper, I introduce the Hierarchical Feature Regression (HFR) which takes a somewhat different approach (Pfitzinger 2024). Instead of shrinking parameters towards zero, it shrinks them towards group target values. The idea is simple: if the effect of two features on the target is similar, their parameters should be similar. This type of grouping is highly intuitive and can lead to robust out-of-sample predictions.3

An Empirical Map of Feature Selection Algorithms

Feature selection (or model selection in more general terms) is a critical — and perhaps one of the most opaque — components of the predictive workflow.1 Burnham and Anderson (2004) refer to the problem using the familiar language of a bias-variance trade-off: on the one hand, a more parsimonious model has fewer parameters and hence reduces the risk of overfitting, on the other hand, more features increase the amount of information incorporated into the fitting process. How to select the appropriate features remains a matter of some debate, with an almost unmanageable host of different algorithms to navigate.

In this analysis, I throw the proverbial kitchen sink at a macroeconomic feature selection problem. Correlation filtering all the way to Bayesian model averaging, lasso regression to random forest importance, genetic algorithms to Laplacian scores. The aim is to explore relationships and (dis)agreements among a multidisciplinary array of feature selection algorithms (23 in total) from several of what Molnar (2022) calls “modeling mindsets”, and to examine comparative robustness, breadth and out-of-sample relevance of the selected information.

As it turns out, there are a few things to learn — particularly in the way algorithms are naturally partitioned into 4 distinct clusters. In the next section, I outline key findings.

tidyfit: Benchmarking regularized regression methods

This workflow demonstrates how tidyfit can be used to easily compare a large number of regularized regression methods in R. Using the Boston house prices data set, the analysis shows how Bayesian methods strongly outperform most alternatives:

Evolving Themescapes: Powerful Auto-ML for Thematic Investment with tidyfit

The recent years have been marked by an unusual amount of geopolitical upheaval and crisis. In this post, I explore the change in importance that this period has elicited in different investment themes. Which trends have grown in importance? What can be discovered about evolving market priorities and the brave new world ahead?

To explore these questions, I draw on a data set of MSCI Thematic and Sector index returns, and calculate the regression-based importance of each theme for each sector over time. The analytical workflow is typical to the quantitative finance setting, essentially requiring the estimation of a large number of linear regressions that provide orthogonal exposures to different investment themes. Here the R package tidyfit (available on CRAN) can be extremely helpful, since it automates much of the machine learning pipeline for regularized regressions (Pfitzinger 2022).

MSCI provides thematic equity indexes for 17 different themes that range from digital health and cybersecurity to millennials and future education. The following plot shows the average change in each theme’s importance — measured as the change in the absolute standardized beta — from before the COVID-19 pandemic to after the pandemic. The regression betas are estimated using an elastic net regression (discussed below). A positive value suggests that the theme has, on average, increased in recent years:

Inference in Neural Networks using an Explainable Parameter Encoder Network

A Parameter Encoder Neural Network (PENN) (Pfitzinger 2021) is an explainable machine learning technique that solves two problems associated with traditional XAI algorithms:

  1. It permits the calculation of local parameter distributions. Parameter distributions are often more interesting than feature contributions — particularly in economic and financial applications — since the parameters disentangle the effect from the observation (the contribution can roughly be defined as the demeaned product of effect and observation).
  2. It solves a problem of biased contributions that is inherent to many traditional XAI algorithms. Particularly in the setting where neural networks are powerful — in interactive, dependent processes — traditional XAI can be biased, by attributing effect to each feature independently.

At the end of the tutorial, I will have estimated the following highly nonlinear parameter functions for a simulated regression with three variables:

A Github version of the code can be found here.