Introduction to the Hierarchical Feature Regression

Linear regression modeling with high-dimensional or multicollinear data sets is prone to overfitting. As a general rule, the more parameters a model uses to predict its target, the higher the risk of overfitting. More parameters can either take the form of more features in your data set, or of more internal model flexibility if you are working with nonlinear methods.

The number of parameters in your model is called the model’s effective degrees of freedom, which I will denote \(\nu\).1 For an OLS regression \(\nu = k+1\), where \(k\) is the number of features in the data set. For a nonlinear algorithm like a random forest \(\nu\) can be significantly higher than the number of features.2 Conversely, for a regularized linear regression, \(\nu\) can be much smaller than \(k\) without reducing the actual number of columns in the data set. One key to robust out-of-sample performance is to capture as much signal as possible with as small a \(\nu\) as possible.

There are many different approaches to reducing the model degrees of freedom (this is arguably a large part of what the disciplines of machine learning and Bayesian analysis are all about). In the linear regression context these include subset selection, \(\ell_1\)/\(\ell_2\)-norm penalties, Bayesian regression, dimension reduction, and several more. Penalized regressions like the lasso (Tibshirani 1996) or the elastic net (Zou and Hastie 2005) are particularly useful since they perform simultaneous model selection and fitting. Penalties are typically imposed on the parameter norm, meaning parameter values are shrunken towards zero.

In this recently published paper, I introduce the Hierarchical Feature Regression (HFR) which takes a somewhat different approach (Pfitzinger 2024). Instead of shrinking parameters towards zero, it shrinks them towards group target values. The idea is simple: if the effect of two features on the target is similar, their parameters should be similar. This type of grouping is highly intuitive and can lead to robust out-of-sample predictions.3

An Empirical Map of Feature Selection Algorithms

Feature selection (or model selection in more general terms) is a critical — and perhaps one of the most opaque — components of the predictive workflow.1 Burnham and Anderson (2004) refer to the problem using the familiar language of a bias-variance trade-off: on the one hand, a more parsimonious model has fewer parameters and hence reduces the risk of overfitting, on the other hand, more features increase the amount of information incorporated into the fitting process. How to select the appropriate features remains a matter of some debate, with an almost unmanageable host of different algorithms to navigate.

In this analysis, I throw the proverbial kitchen sink at a macroeconomic feature selection problem. Correlation filtering all the way to Bayesian model averaging, lasso regression to random forest importance, genetic algorithms to Laplacian scores. The aim is to explore relationships and (dis)agreements among a multidisciplinary array of feature selection algorithms (23 in total) from several of what Molnar (2022) calls “modeling mindsets”, and to examine comparative robustness, breadth and out-of-sample relevance of the selected information.

As it turns out, there are a few things to learn — particularly in the way algorithms are naturally partitioned into 4 distinct clusters. In the next section, I outline key findings.

tidyfit: Benchmarking regularized regression methods

This workflow demonstrates how tidyfit can be used to easily compare a large number of regularized regression methods in R. Using the Boston house prices data set, the analysis shows how Bayesian methods strongly outperform most alternatives:

Inference in Neural Networks using an Explainable Parameter Encoder Network

A Parameter Encoder Neural Network (PENN) (Pfitzinger 2021) is an explainable machine learning technique that solves two problems associated with traditional XAI algorithms:

  1. It permits the calculation of local parameter distributions. Parameter distributions are often more interesting than feature contributions — particularly in economic and financial applications — since the parameters disentangle the effect from the observation (the contribution can roughly be defined as the demeaned product of effect and observation).
  2. It solves a problem of biased contributions that is inherent to many traditional XAI algorithms. Particularly in the setting where neural networks are powerful — in interactive, dependent processes — traditional XAI can be biased, by attributing effect to each feature independently.

At the end of the tutorial, I will have estimated the following highly nonlinear parameter functions for a simulated regression with three variables:

A Github version of the code can be found here.

tidyfit: Extending the tidyverse with AutoML

tidyfit is an R-package that facilitates and automates linear regression and classification modeling in a tidy environment. The package includes several methods, such as Lasso, PLS and ElasticNet regressions, and can be augmented with custom methods. tidyfit builds on the tidymodels suite, but emphasizes automated modeling with a focus on the linear regression and classification coefficients, which are the primary output of tidyfit.