Linear regression modeling with high-dimensional or multicollinear data sets is prone to overfitting. As a general rule, the more parameters a model uses to predict its target, the higher the risk of overfitting. More parameters can either take the form of more features in your data set, or of more *internal* model flexibility if you are working with nonlinear methods.

The number of parameters in your model is called the model’s effective degrees of freedom, which I will denote \(\nu\).^{1} For an OLS regression \(\nu = k+1\), where \(k\) is the number of features in the data set. For a nonlinear algorithm like a random forest \(\nu\) can be significantly higher than the number of features.^{2} Conversely, for a regularized linear regression, \(\nu\) can be much smaller than \(k\) without reducing the actual number of columns in the data set. One key to robust out-of-sample performance is to **capture as much signal as possible with as small a** \(\nu\) **as possible**.

There are many different approaches to reducing the model degrees of freedom (this is arguably a large part of what the disciplines of machine learning and Bayesian analysis are all about). In the linear regression context these include subset selection, \(\ell_1\)/\(\ell_2\)-norm penalties, Bayesian regression, dimension reduction, and several more. Penalized regressions like the lasso (Tibshirani 1996) or the elastic net (Zou and Hastie 2005) are particularly useful since they perform simultaneous model selection and fitting. Penalties are typically imposed on the parameter norm, meaning parameter values are shrunken towards zero.

In this recently published paper, I introduce the **Hierarchical Feature Regression** (HFR) which takes a somewhat different approach (Pfitzinger 2024). Instead of shrinking parameters towards zero, it shrinks them towards group target values. The idea is simple: if the effect of two features on the target is similar, their parameters should be similar. This type of grouping is highly intuitive and can lead to robust out-of-sample predictions.^{3}