Introduction to the Hierarchical Feature Regression

Linear regression modeling with high-dimensional or multicollinear data sets is prone to overfitting. As a general rule, the more parameters a model uses to predict its target, the higher the risk of overfitting. More parameters can either take the form of more features in your data set, or of more internal model flexibility if you are working with nonlinear methods.

The number of parameters in your model is called the model’s effective degrees of freedom, which I will denote \(\nu\).1 For an OLS regression \(\nu = k+1\), where \(k\) is the number of features in the data set. For a nonlinear algorithm like a random forest \(\nu\) can be significantly higher than the number of features.2 Conversely, for a regularized linear regression, \(\nu\) can be much smaller than \(k\) without reducing the actual number of columns in the data set. One key to robust out-of-sample performance is to capture as much signal as possible with as small a \(\nu\) as possible.

There are many different approaches to reducing the model degrees of freedom (this is arguably a large part of what the disciplines of machine learning and Bayesian analysis are all about). In the linear regression context these include subset selection, \(\ell_1\)/\(\ell_2\)-norm penalties, Bayesian regression, dimension reduction, and several more. Penalized regressions like the lasso (Tibshirani 1996) or the elastic net (Zou and Hastie 2005) are particularly useful since they perform simultaneous model selection and fitting. Penalties are typically imposed on the parameter norm, meaning parameter values are shrunken towards zero.

In this recently published paper, I introduce the Hierarchical Feature Regression (HFR) which takes a somewhat different approach (Pfitzinger 2024). Instead of shrinking parameters towards zero, it shrinks them towards group target values. The idea is simple: if the effect of two features on the target is similar, their parameters should be similar. This type of grouping is highly intuitive and can lead to robust out-of-sample predictions.3

The HFR works by decomposing linear regression coefficients along a hierarchical graph and then shrinking the height of that graph. This is mathematically equivalent to a direct constraint on \(\nu\). The method has one hyperparameter, \(\kappa\), which can take values between 0 (which results in \(\nu = 1\)) and 1 (which results in \(\nu = k+1\)). Decomposing the coefficients along a hierarchical graph also conveys a wealth of information about the relationship among features and the signal factors driving the target.

Fig. 1 shows an example of a resulting graph for Boston housing data using 3 different levels of shrinkage:

Example of dendrograms from an HFR regression using Boston housing data and $\kappa = 1$ (left), $\kappa = 0.5$ (middle) and $\kappa = 0.1$ (right).

Figure 1: Example of dendrograms from an HFR regression using Boston housing data and \(\kappa = 1\) (left), \(\kappa = 0.5\) (middle) and \(\kappa = 0.1\) (right).

Applying the HFR alongside a panel of benchmark methods in a predictive workflow using Boston housing data results in Fig. 2. Here I calculate out-of-sample RMSEs for 12 different regression techniques:

  1. OLS
  2. HFR
  3. 9 other linear regression methods from different conceptual paradigms (subset selection, penalized, etc.)
  4. One nonlinear method (random forest).

All linear methods predict house prices using available features and feature interactions. The nonlinear random forest predicts house prices using only the features since it forms its own — much more flexible — set of internal feature interactions.

The HFR performs exceptionally well, achieving the highest out-of-sample accuracy of all the linear methods, while the OLS regression is least accurate illustrating the problem of overfitting. The 10-fold cross-validation used to determine optimal hyperparameters results in an average \(\kappa = 0.58\), which suggests that the HFR reduces \(\nu\) to just over half as many effective parameters as there are features in the data set.4 The good performance of the random forest suggests there is nonlinearity that cannot be captured by the linear form of the remaining methods.

Out-of-sample prediction performance for 50 random training/test splits of the Boston housing data set. Data is randomly split into equally sized training and testing samples. A 10-fold cross validation is used on the training data to obtain optimal hyperparameters. The error bars represent the q1-q3 quartile RMSE ranges.

Figure 2: Out-of-sample prediction performance for 50 random training/test splits of the Boston housing data set. Data is randomly split into equally sized training and testing samples. A 10-fold cross validation is used on the training data to obtain optimal hyperparameters. The error bars represent the q1-q3 quartile RMSE ranges.

The above analysis can be reproduced using the tidyfit package (see here) with code provided in this gist.

Apart from accurate predictions, the HFR adds a valuable perspective to exploratory workflows by contributing to a better understanding of the effects structure in the data. To illustrate this point, I will walk through an example using the software package provided for R:

# install.packages("hfr")
library(hfr)

Next I load the Boston housing data used in the analysis above and create the model matrix which includes all variables and variable interactions:

data <- MASS::Boston
# drop the intercept column
x <- model.matrix(medv ~ .*., data)[,-1]
y <- data$medv

The HFR model is estimated in the next code chunk, and a hierarchical graph is plotted which visualizes the structure captured in the model. Fig. 3 is a modified version of traditional dendrograms with two important distinctions:

  1. The height represents the effective model degrees of freedom. Thus an HFR model with \(\kappa = 1\) (no shrinkage) will produce a graph with a height of 92 (\(k+1\)), while the below graph has a height of 53.8 with several levels shrunken.

  2. The statistical significance in the level-specific regressions is visualized on the branch lines. Branches with \(p\)-values greater than 0.1 are shown as dashed lines.

# Fit an HFR model
mod <- hfr(x, y, kappa = 0.58)
# Generate the graph dendrogram
par(cex=0.5)
plot(mod, confidence_level = 0.9, max_leaf_size = 2)
Dendrogram of the HFR fit with $\kappa = 0.58$. Dashed lines indicate statistically insignificant branches. The colorbar on the right represents the incrememtal contribution of the new information added in each level of the hierarchy to overall goodness-of-fit (as measured by $R^2$). The size of the leaf nodes represents the coefficient values.

Figure 3: Dendrogram of the HFR fit with \(\kappa = 0.58\). Dashed lines indicate statistically insignificant branches. The colorbar on the right represents the incrememtal contribution of the new information added in each level of the hierarchy to overall goodness-of-fit (as measured by \(R^2\)). The size of the leaf nodes represents the coefficient values.

The HFR divides the data into broad clusters which initially contribute jointly to the predictions of the target. Each lower level refines the predictions and adds nuance to the coefficients. In this way signal is stacked — starting with more general patterns and adding incremental idiosyncratic information. By constraining the amount of idiosyncratic signal added by lower levels, shrinkage towards broader signal clusters is achieved.

Fig. 4 visualizes how different values for \(\kappa\) change the height of the graph and the resulting value of \(\nu\):

HFR dendrograms for different levels of $\kappa$.

Figure 4: HFR dendrograms for different levels of \(\kappa\).

In the right panel of Fig. 4, the parameters are shrunken to reflect an effective information content equivalent to 4.5 features (plus intercept). This results in shrinkage towards target values for those parameters. This shrinkage can be illustrated by comparing the parameter values in those clusters to the parameters generated by alternative methods.

I begin by standardizing the data to ensure that parameters are comparable.5

scaled_df <- data.frame(medv = y, scale(x), check.names = FALSE)

Now I fit HFR, lasso and principal components regressions using the tidyfit::regress function. tidyfit is a package that makes comparisons between different regression techniques a lot easier by handling the training workflow and returning clean and comparable outputs:

library(tidyfit)
set.seed(123)
# Fit the models using 'regress'
# Note that the interactions do not need to be added to the formula
# since they are already contained in 'scaled_df'.
fit_df <- scaled_df |> 
  regress(medv ~ ., 
          HFR = m("hfr", kappa = 0.58), 
          Lasso = m("lasso"), `Principal Components` = m("pcr"),
          .cv = "vfold_cv", .cv_args = list(v = 10))

coef_df <- coef(fit_df)

Finally, I plot densities of the absolute coefficients in the 4 largest clusters in the dendrogram. The plot illustrates two points: (1) The HFR parameter estimates cluster more than the estimates of other methods with a higher density peak around the shrinkage target values. (2) The shrinkage target does not need to be close to zero as shown by Cluster 4, illustrating the fact that the HFR does not penalize the parameter norm.

In summary, the HFR provides tools both to explore data sets and to obtain good out-of-sample predictions that compare favorably to many alternative linear regression techniques. Examining the number and composition of signal clusters, the optimal shrinkage of the hierarchical graph and the contribution of each of the resulting levels to the overall goodness-of-fit (represented by the color bar on the right of the dendrograms) can help us to develop a better understanding of the nature of the data and the existence of broad signal clusters as opposed to individual signal contributions by each variable.

As a final note, these same concepts can also be generalized to optimization more broadly as illustrated in the application to financial portfolio optimization in Pfitzinger and Katzke (2023). In this context, dividing the data set into signal clusters has the added benefit of providing a good basis for signal diversification — a critical feature in financial portfolio construction.

References

Pfitzinger, Johann. 2024. “Cluster Regularization via a Hierarchical Feature Regression.” Econometrics and Statistics in press (January). https://doi.org/10.1016/j.ecosta.2024.01.003.
Pfitzinger, Johann, and Nico Katzke. 2023. “Perspectives on Optimal Hierarchical Portfolio Selection.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4409173.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.

  1. Not to be confused with the residual degrees of freedom, which is \(n-\nu\), where \(n\) is the total number of observations.↩︎

  2. The effective degrees of freedom is difficult to quantify in a nonlinear regression, and can often only be approximated using simulation techniques.↩︎

  3. Multiple alternative approaches exist that pursue a similar objective. The interested reader is referred to the paper for a discussion of alternative methods, as well as their advantages and disadvantages compared to HFR.↩︎

  4. It is important to note that this is not feature selection. Like dimension reduction techniques, the HFR keeps all features but reduces the information contained in the features to the equivalent of a lower dimensional sub-space. Unlike normal dimension reduction techniques, this is achieved by imposing a hierarchical graph structure.↩︎

  5. It is important to note that most of the methods compared here and in Fig. 2 standardize the data internally (exceptions are, for instance, the Bayesian techniques, or subset selection which do not require standardized data). However, the parameters are always returned on the original scale. Thus it is necessary to standardize the data if we require standardized parameters.↩︎

Related Articles