1. Introduction
A variable annuity (VA) is a tax-deferred retirement vehicle that is created by insurance companies to address concerns many people have about outliving their assets. VA policies typically contain guarantees, which include guaranteed minimum death benefit (GMDB), guaranteed minimum accumulation benefit (GMAB), guaranteed minimum income benefit (GMIB), and guaranteed minimum withdrawal benefit (GMWB) (Hardy, Reference Hardy2003). These guarantees provide policyholders downside protection during significant declines in the financial market. As a result, these are financial guarantees that cannot be adequately addressed by traditional actuarial approaches. For example, during a bear market when stock prices are falling, insurance companies can expect to lose large sums of money on their portfolios of VA policies.
Dynamic hedging is adopted by many insurance companies to mitigate the financial risks associated with VA guarantees. An important step of dynamic hedging is to quantify the risks, which involves calculating the fair market values of the guarantees. Since the guarantees are complex, their fair market values cannot be determined explicitly or in closed form. In practice, insurance companies resort to Monte Carlo simulation to calculate the fair market values of guarantees. While Monte Carlo simulation is flexible and can handle any types of guarantees, it is computationally intensive. Using Monte Carlo simulation to calculate the fair market values of a large portfolio of VAs can take days or weeks.
To speed up the valuation of VA portfolios based on Monte Carlo simulation, metamodelling techniques have been proposed in the past few years. See Gan & Lin (Reference Gan and Lin2015), Gan & Valdez (Reference Gan and Valdez2017a), Hejazi et al. (Reference Hejazi, Jackson and Gan2017), Gan (Reference Gan2018), Xu et al. (Reference Xu, Chen, Coleman and Coleman2018), Dang et al. (Reference Dang, Feng and Hardy2019), Liu & Tan (Reference Liu and Tan2020), Gweon et al. (Reference Gweon, Li and Mamon2020), Lin & Yang (Reference Lin and Yang2020), and Feng et al. (Reference Feng, Tan and Zheng2020). Metamodelling techniques involve building a predictive model based on a small number of representative VA policies in order to reduce the number of policies that are valued by Monte Carlo simulation. Specifically, a metamodelling technique consists of the following four components:
1. Select a small number of representative VA policies;
2. Use Monte Carlo simulation to calculate the fair market values of the representative policies;
3. Build a predictive model, called a metamodel, based on the representative policies and their fair market values; and
4. Use the predictive model to estimate the fair market value for every VA policy in the portfolio.
Since only a small number of VA policies are valued by Monte Carlo simulation and the predictive model is much faster than Monte Carlo simulation, metamodelling techniques have the potential to reduce the valuation time significantly.
In the past, ordinary kriging (Gan, Reference Gan2013), universal kriging (Gan & Lin, Reference Gan and Lin2017), generalised beta of the second kind (GB2) regression model (Gan & Valdez, Reference Gan and Valdez2018), and neural networks (Hejazi & Jackson, Reference Hejazi and Jackson2016; Xu et al., Reference Xu, Chen, Coleman and Coleman2018) have been used to build various types of metamodels. Kriging is a family of estimators used to interpolate spatial data. Advantages of kriging methods include producing accurate aggregate results at the portfolio level and requiring only a few parameters to estimate. However, kriging methods have the disadvantages that they require a large number of distance calculations and assume normal distribution of the response variable. The GB2 regression model has the advantage that it can handle highly skewed data, but estimating parameters poses quite a challenge. Neural networks have several advantages: they can approximate any compactly supported continuous function arbitrarily well; they can model data that has non-linear relationships between variables; and they can handle interactions between variables. From a metamodelling perspective, however, we intend to select a small number of representative samples from the entire portfolio. Neural networks perform better by learning from a larger training dataset, which can be challenging for our purposes.
In this paper, we study the use of tree-based models to predict the fair market value of the guarantees embedded in a VA policy. Originating in early 1960s, tree-based models are data mining algorithms that repeatedly partition the space of the explanatory variables to create a tree structure in predicting the response variable. Using survey data, Morgan & Sonquist (Reference Morgan and Sonquist1963) developed the very first naive regression tree algorithm called the Automatic Interaction Detection (AID). Nowadays, tree-based models have become an attractive alternative predictive tool for building classification and regression models. Tree-based models possess the following properties:
• Tree-based models are considered as non-parametric models, and thus do not require to specify the form of the explanatory variables to the response variable.
• Tree-based models can handle missing data automatically.
• Tree-based models can handle categorical variables naturally.
• Tree-based models can capture non-linear effects and handle interactions between variables.
• Tree-based models can perform and assess variable importance (Breiman et al., Reference Breiman, Friedman, Olshen and Stone1984; Ishwaran, Reference Ishwaran2007).
• Tree-based models, especially single tree models, can be interpreted straightforwardly by visualising the tree structure.
These advantages have encouraged us to explore and demonstrate the promising use of tree-based models as an alternative to various metamodels for the efficient valuation of large portfolios of VA products. This paper provides additional details to enhance predictive performance of tree-based models, for example, hyperparameter tuning.
This paper has been structured as follows. In Section 2, we discuss the concept of tree-based models and their extensions. Here, we also describe another framework of binary splitting generally called unbiased recursive partitioning; conditional inference trees and their extension fall in this category. Section 3 provides details of parameter tuning, which is an important process in building tree-based models that helps significantly improve prediction accuracy. In particular, methods used in this paper include cross-validation and optimisation. In Section 4, we summarise the synthetic dataset used in our empirical investigation and perform preliminary data exploration. Furthermore, we compare the performance of various tree-based models, as well as other metamodels, in terms of predictive accuracy and computational efficiency. A menu of tree-based models is presented for purposes of being comprehensive. Section 5 concludes the paper with some remarks.
2. Tree-based Models
In this section, we present several tree-based models. To that end, we let the response variable be denoted as $\textbf{Y}$, the sample space as
$\mathcal{Y}$ (which can be multivariate as well), and let n denote the number of observations. The ith sample with p-dimensional explanatory variables is denoted as
$\textbf{X}_i = (x_{i1},x_{i2}\ldots,x_{ip})$, which is sampled from the space
$\mathcal{X} = \mathcal{X}_{1} \times \ldots \times \mathcal{X}_{p}$. Each observation in the dataset can form a learning sample denoted by
$\mathcal{L}_n$:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU1.png?pub-status=live)
2.1. Classification and Regression Tree (CART)
The Classification and Regression Tree (CART) algorithm (Breiman et al., Reference Breiman, Friedman, Olshen and Stone1984) uses greedy search called recursive binary partition to create a tree structure. The algorithm includes growing and pruning steps. It can easily handle missing data by utilising “surrogate” splitting, which finds an alternative explanatory variable that mimics the explanatory variable that contains missing values. The surrogate splitting process also provides variable importance at each node by measuring the decrease of the error function. Breiman et al. (Reference Breiman, Friedman, Olshen and Stone1984) obtained conditions for all recursive partitioning techniques to be Bayes risk consistent. In conventional terms, the trees obtained by the algorithm are called classification trees when the response variable is categorical and are called regression trees when the response variable is continuous.
Here, we briefly formulate the process in the regression framework. For details, refer to Quan & Valdez (Reference Quan and Valdez2018). The algorithm divides the explanatory variable space $\mathcal{X}$ into disjoint M regions,
$R_{1}$,
$R_{2}$,
$\ldots$,
$R_{M}$, and assigns a constant
$c_{m}$ as the predicted value for region
$R_m$:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU2.png?pub-status=live)
where $\Theta = \{R_{m},c_{m} \}_{m=1}^{M}$.
Under the sum of squared errors (SSE) loss function, the best or optimal $\hat c_m$ is the average of
$y_i$ in the region
$R_{m}$. In other words, the algorithm optimises the following objective function L(T) with penalty:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU3.png?pub-status=live)
The tuning parameter $\alpha \geq 0$ governs the trade-off between the size of the tree and its goodness of fit to the data. Estimation of
$\alpha$ can be achieved by cross-validation, which is explained in a later section.
2.2. Ensemble methods
Ensemble methods combine several models to help improve prediction accuracy. Two popular ensemble methods for regression trees are: (a) bagging and random forests and (b) gradient boosting.
2.2.1. Bagging and random forests
Bagging (Breiman, Reference Breiman1996) uses an ensemble of sufficiently deep CART trees, $\{T(\textbf{X};\ \Theta_b),\break b= 1,2,...,B\}$, without pruning. These trees are built on B bootstrap samples of the training dataset and for prediction, we take the average of B regression trees to get the prediction:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU4.png?pub-status=live)
Random forests (Breiman, Reference Breiman2001) reduce correlation between the CART trees by selecting the best split from a random subset of explanatory variables at each node of a tree. In effect, the average prediction of multiple regression trees is expected to have lower variance than a single individual regression tree. Larger random sets of explanatory variables can improve the predictive capability of individual trees, but it can also increase the correlation between trees and void any gains from averaging multiple predictions. The bootstrap resampling of the data for training each tree also increases the variation between the trees. The accuracy of random forests depends on the strength of each individual tree and a measure of the dependence between them.
2.2.2. Gradient boosting
Gradient boosting (Freund & Schapire, Reference Freund and Schapire1997), or sometimes called gradient boosted regression trees, grows trees by sequentially putting more weights on the residuals from previous trees. The boosted tree is a sum of such trees:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU5.png?pub-status=live)
where $T_{b}(\textbf{X};\ \Theta_b)$ are the regression trees. B is the number of iterations or the number of additive trees. In each step b, for
$b=1,\ldots,B$, we need to find the regression
$\Theta_b$ based on the following optimisation problem:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU6.png?pub-status=live)
where $\Theta_b=\{R_{mb},c_{mb} \}_{m=1}^{M}$.
The loss function L can be the squared error loss function. For other differentiable loss functions, the above optimal solution can be obtained using numerical optimisation via gradient boosting. See Friedman (Reference Friedman2001).
An alternative boosting algorithm, which is not as attractive for our empirical dataset, is delta boosting. However, for generalised linear models, Lee & Lin (Reference Lee and Lin2018) demonstrated that delta boosting outperforms the gradient boosting algorithm using claims data on collision coverage for vehicle insurance from a Canadian insurer.
2.3. Unbiased recursive partitioning
CART algorithms described above employ recursive binary partition. This greedy search causes some drawbacks. One drawback is overfitting, which can be resolved using a pruning process by applying cross-validation. Another drawback is the resulting bias in variable selection, especially when the explanatory variables present many possible splits or missing values. This latter drawback is harder to remedy. Strobl et al. (Reference Strobl, Malley and Tutz2009) illuminated the bias in variable selections with many categorical and numerical variables or, even more unintuitively, many missing values. These explanatory variables are artificially preferred in recursive binary partition algorithms. See Strobl et al. (Reference Strobl, Boulesteix, Zeileis and Hothorn2007). This evidence alerts us to investigate variable selection and interpretations in VA valuation, especially when many benefit riders are present.
2.3.1. Conditional inference trees
As a remedy for the bias drawback, Hothorn et al. (Reference Hothorn, Hornik and Zeileis2006) introduced a conditional inference framework for unbiased recursive partitioning, which has stopping criteria based on the statistical permutation test in Strasser & Weber (Reference Strasser and Weber1999). Unbiased recursive partitioning framework can be applied to univariate continuous or discrete regression, censored regression, classification, ordinal regression, and multivariate regression. In this paper, we focus on univariate continuous regression. The conditional inference trees algorithm has been shown that the predictive accuracy is comparable to that produced by CART.
Guelman et al. (Reference Guelman, Guillén and Pérez-Marn2014) applied conditional inference trees to personalised cross-sell marketing of insurance products, by building personalised treatment learning to select profitable policyholders. In particular, the objective of the marketing campaign was to determine the group of customers with existing auto insurance policies that should be optimally targeted to be offered an additional home insurance policy.
The main difference between CART and conditional inference trees algorithm is the variable selection at each split and the stopping criterion. To explain the details, we need to define a few terms. The conditional distribution of a statistic, $F(\textbf{Y}|\textbf{X})$, measures the association between the response variable and the explanatory variables.
It is possible that some explanatory variables $x_{ij}$ are missing in real life data. The tree-based model ultimately finds membership for observations and assigns them to the terminal node,
$R_{m}$. To define this membership for each observation, we introduce the vector of case weights denoted by
$\textbf{w}=(w_1, \ldots, w_n)$. Hence, for each terminal node,
$R_{m}$, we have a vector of case weights
$\textbf{w}=(w_1, \ldots, w_n)$. The weight,
$w_i$, can be any positive value if it is an observation in the specific terminal node. Without loss of generality, we can restrict the weight value to either one or zero. Define symmetric group,
$S(\mathcal{L}_n, \textbf{w})$, as all possible permutations of observations with weight one. For example, if
$w_1 = 1, w_2 = 0, w_3 = 1, w_4 = w_5 = \ldots = w_n = 0$, then
$S(\mathcal{L}_n, \textbf{w})$ =
$\{ ( \textbf{Y}_1, x_{11}, x_{12}, \ldots, x_{1p})$,
$(\textbf{Y}_3, x_{31}, x_{32}, \ldots, x_{3p})$,
$( \textbf{Y}_3, x_{11}, x_{12}, \ldots, x_{1p})$,
$(\textbf{Y}_1$,
$x_{31}, x_{32}, \ldots, x_{3p}) \}$.
We now briefly describe unbiased recursive binary splitting. First, under a specific vector of case weights, apply statistical hypothesis test to determine if there is any dependency between the response variable and the explanatory variables. If there is a dependency, then find the most significant (strongest association) explanatory variable to perform the split and update the case weights. If there is no dependency, then stop the process.
In formulating the process, the null hypothesis is that all the explanatory variables are independent of the response variable
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU7.png?pub-status=live)
where $H^j_0\,{:}\, F(\textbf{Y}|\textbf{X}_{\cdot j})=F(\textbf{Y})$. Then use the conditional distribution of linear statistics in the permutation test (Strasser & Weber, Reference Strasser and Weber1999). Under
$H_0$ and given all permutations of the response variable, the conditional expectation
$\mu_j \in \mathbb{R}^{p_jq}$ can be derived as follows (Hothorn et al., Reference Hothorn, Hornik and Zeileis2006):
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqn1.png?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU8.png?pub-status=live)
The linear statistic, $T_{X_{.}}(\mathcal{L}_n, \textbf{w})$, can be standardised since
$T_{X_{.}}(\mathcal{L}_n, \textbf{w})-\mu_.$ is asymptotically normal under
$H_0$ and conditioned on the symmetric
$\sigma$-fields
$S(\mathcal{L}_n, \textbf{w})$. See Strasser & Weber (Reference Strasser and Weber1999). There are a few ways of standardisation (Hothorn et al. Reference Hothorn, Hornik and Zeileis(2006)). One way is based on the maximum of the absolute values of the standardised linear statistics defined by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU9.png?pub-status=live)
Another way is based on the quadratic form defined by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU10.png?pub-status=live)
where $\Sigma_.^+$ is the Moore–Penrose inverse of
$\Sigma_.$ which may be computationally intensive. The Moore–Penrose inverse is the most common type of pseudoinverse. See Moors (Reference Moors1920) and Penrose (Reference Penrose1955). For the case of the quadratic form, the test statistic asymptotically follows a chi-squared distribution with
$rank(\Sigma_.)$ degrees of freedom.
Since the test statistic $z(T_{X_{.}} , \mu_. , \Sigma_.)$ cannot be compared directly due to possibly difference of scale in the explanatory variables, we use the p-value to choose the most significant explanatory variable. We select explanatory variable,
$X_{j^*}$, that has the smallest p-value,
$P_{j^*} = \underset{j=1,\ldots,p}{\operatorname{arg\,min}} \ P_j$, where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU11.png?pub-status=live)
and $t_{X_{j}} \in \mathbb{R}^{p_jq}$ is the observed test statistic from the dataset.
After picking the best explanatory variable $X_{j^*}$, we then find the best splitting point according to the splitting criterion. This criterion can be a simple binary split like CART, or a multiway split as in O’Brien (Reference O’Brien2004).The permutation test framework can be used to find the best split point.
Given a predefined significance level, $\alpha$, if the null hypothesis,
$H_0$, cannot be rejected, then stop the binary splitting process. Here, the p-value for the null hypothesis can be calculated using the Bonferroni-adjusted p-value(
$1-(1-P_j)^p$) or min-p-value resampling. For more advanced multiple testing, we refer the reader to Westfall & Young (Reference Westfall and Young1993).
The level of significance, $\alpha$, controls the tree size. It can be tuned using cross-validation and this process can be similar to tree pruning in CART. In detail, one would initially set higher significance level,
$\alpha$, which leads to a larger tree, and then prune the terminal node by setting a lower significance level,
$\alpha^* \leq \alpha$. The significance level can be predetermined according to a tolerance based on some actuarial expertise. Moreover, the significance level can be interpreted in the traditional statistical sense of balancing the Type I and Type II errors.
2.3.2. Conditional random forests
Random forests variable importance provides essential interpretation for the ensemble tree-based models although it is not reliable to perform variable selection according to variable importance scores. This is especially true for cases where the explanatory variables vary in the scale of measurement or number of categories. See Strobl et al. (Reference Strobl, Boulesteix, Zeileis and Hothorn2007). On the other hand, conditional random forests are created based on conditional inference trees that can provide unbiased variable selection.
The framework of conditional random forests is very similar to that of random forests. The conditional inference trees are built on the bootstrap sample or subsample of the original dataset. The random subset of the explanatory variables is considered at each split. However, there are few differences between conditional random forests and random forests. First, after building all the conditional inference trees, the final ensemble model averages the observation weights extracted from each tree. It is a different aggregation scheme from the random forests which simply averages the prediction. Second, conditional random forests can extensively model censored, multivariate, as well as ordered response variable. Third and finally, when explanatory variables vary in their scale of measurement and the number of categories, which is very typical of a dataset, the conditional random forests can provide unbiased variable selection and variable importance based on conditional inference trees that use subsamples without replacement.
3. Parameter Tuning of Tree-based Models
As noted in the previous section, the tree-based model hyperparameter tuning (optimisation) usually involves cross-validation to perform model selection. Table 1 lists some common tuning hyperparameters for tree-based models. The unbiased recursive binary splitting framework can be implemented in both party and partykit.
3.1. Cross-validation
Cross-validation was introduced to fix the overoptimistic prediction accuracy on the same dataset that model trained. In other words, it is an attempt to avoid overfitting. See Mosteller & Tukey (Reference Mosteller and Tukey1968), Stone (Reference Stone1974), and Geisser (Reference Geisser1975).
Cross-validation is a way to examine tree-based model performance on hypothetical validation dataset when an exact validation set is not available. In detail, the dataset is separated into two disjoint parts: training dataset and validation dataset. The model is built on the training dataset and the prediction performance is examined on the validation dataset.
Table 1. R packages for tree-based models with their tuning hyperparameters
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab1.png?pub-status=live)
Various splitting strategies lead to different cross-validation techniques. Data splitting requires certain assumptions. For example, data are identically distributed and there must be independence between training dataset and validation dataset. If these assumptions are not satisfied, then some modifications are needed for the cross-validation. See Opsomer et al. (Reference Opsomer, Wang and Yang2001) and Leung (Reference Leung2005). Usually for insurance datasets, time dependency and outliers are present. It requires extra attention when we apply general cross-validation to perform model selection.
In the holdout method (Devroye & Wagner, Reference Devroye and Wagner1979), the dataset is separated only once, and typically the validation dataset is smaller than the training dataset. The holdout method can be considered as the simplest cross-validation. While the holdout method suffers from a drawback that the results highly depend on data split, it is because the data in the validation dataset may have important information and this information is left out when we train model. In other words, the data segment easily leads to bias in the result. To deal with this issue, other cross-validation is discussed later.
Unlike the holdout method, other cross-validation generally splits dataset several times and averages prediction accuracy on validation dataset. The question arises on how to split the dataset. There are two types of splitting: exhaustive data splitting and partial data splitting.
3.1.1. Exhaustive cross-validation
Exhaustive cross-validation methods contain training and testing on all possible ways to divide the original dataset into a training dataset and a validation dataset. See Stone (Reference Stone1974), Allen (Reference Allen1974), Geisser (Reference Geisser1974), and Shao (Reference Shao1993). There are two popular exhaustive cross-validation:
• Leave-one-out. Each data point is withdrawn from the original dataset and used as validation data. Obviously, it needs n runs. Leave-one-out cross-validation provides nearly unbiased estimation while it suffers from high variance.
• Leave-p-out. Here p data points are withdrawn from the original dataset and it takes
$\binom{N}{k}$ runs. In practice, for large dataset, exhaustive cross-validation is computationally intensive.
3.1.2. Non-exhaustive cross-validation
Non-exhaustive cross-validation apply partial data splitting. One well-known non-exhaustive cross-validation is k-fold cross-validation (Geisser, Reference Geisser1975). In k-fold cross-validation, the original dataset is segmented into k equal sized subsamples. In each run, one of the k subsamples is held out as validation dataset and model is built on the remaining $k-1$ subsamples. In total, there are k runs with each subsample selected only once as validation dataset. This process assures all the data points have a chance to belong in the training dataset and the validation dataset. This method takes the average of the prediction results on the k validation datasets.
When k is equal to the number of the dataset, the k-fold cross-validation becomes the leave-one-out cross-validation. It is an open question to choose the best k for the k-fold cross-validation. If the number k is selected, then the size of the training set is fixed as well as the number of splits. The bias of the k-fold cross-validation decreases with k since larger k leads to a larger training set. On the other hand, the variance of the k-fold cross-validation increases with k since a larger k leads to a larger number of validation procedure (i.e. more runs). Practically, the number k is often chosen to be between 5 and 10. Ten-fold cross-validation has been shown to provide good model selection and point estimation. See Kohavi (Reference Kohavi1995).
Additional cross-validation techniques and statistical properties of cross-validation are discussed in Arlot & Celisse (Reference Arlot and Celisse2010) and Leung (Reference Leung2005).
3.2. Hyperparameter optimisation
3.2.1. Grid search
Grid search is the most traditional way to optimise hyperparameters. It is an exhaustive searching process on the manually specified subset of hyperparameter space. The selection of the best combination of hyperparameters is usually based on the cross-validation. Determining the subset involves setting bounds and discretisation that requires expertise on the model. Grid search is computationally intensive due to the explosive number of combinations of hyperparameters. For example, if one model has p parameters and each parameter has c choices, then the subset contains $c^p$ combinations. Fortunately, grid search is essentially parallel since evaluating each combination of the hyperparameters is independent of each other. This makes grid search feasible given enough computational power.
3.2.2. Random search
Random search is to randomly search the grid of hyperparameters that is manually specified and it has similar performance compared to the grid search. It can outperform the grid search given the same computation constraint and time, especially when only a small number of hyperparameters affects the final performance of the model (Bergstra & Bengio, Reference Bergstra and Bengio2012). In other words, if the close-to-optimal region of hyperparameters occupies at least 5% of the search space, then a random search with a certain number of trials (typically 40–60 trials) will be able to find that region with high probability. Like grid search, it can be made parallel. When compared to grid search, random search requires less number of trials and allows including prior knowledge of how to sample.
3.2.3. Bayesian optimisation
Bayesian hyperparameter tuning establishes knowledge about the association between the hyperparameter settings and model performance tos improve selection for the next hyperparameter settings. Unlike previously discussed tuning methods, the selection of the following hyperparameter settings is no longer independent of the prior selection. In other words, it is sequential. Hence it cannot be easily made parallel. The hyperparameter tuning becomes an optimisation problem. There are several sequential global optimisation methods for finding the hyperparameter setting that maximise the model generalisation performance. One of the most popular techniques is Bayesian optimisation. Bayesian optimisation models generalisation performance as a sample from a Gaussian Process (GP) (Snoek et al., Reference Snoek, Larochelle and Adams2012), and creates a regression model to formalise the relationship between the model performance and the model hyperparameters.
Specifically, let function $f: \mathcal{H} \to \mathbb{R}$ model hyperparameter setting
$h \in \mathcal{H}$ to a prediction accuracy of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_eqnU12.png?pub-status=live)
within a domain $h \in \mathbb{R}^d$ which is a bounding box. d is the number of tuning hyperparameters. The function f is a realisation of a GP with mean
$\mu$ and covariance kernel K, i.e.,
$f \sim \mathcal{GP}(\mu, K)$. The Bayesian optimisation assumes the prediction follows the normal distribution. After choosing the kernel function K, we can compute the mean and variance for this normal distribution. For further details, see Snoek et al. (Reference Snoek, Larochelle and Adams2012) and Martinez-Cantin (Reference Martinez-Cantin2014).
There are also a few other automatic hyperparameter optimisation techniques: gradient-based, evolutionary and those based on tree-structured Parzen estimator. Gradient-based optimisation computes the gradient with respect to hyperparameters and then optimises the hyperparameters using gradient descent. See Bergstra et al. (Reference Bergstra, Bardenet, Bengio and Kégl2011) and Larsen et al. (Reference Larsen, Hansen, Svarer and Ohlsson1996).
4. Applications to VA Valuation
In this section, we demonstrate the application of tree-based models to address a computational problem arising from the valuation of variable annuity products. We use the hierarchical k-means algorithm (Nister & Stewenius, Reference Nister and Stewenius2006) to select the representative VA policies.
4.1. Data description
The dataset is a synthetic dataset that contains 190,000 VA policies. For details, see Gan & Valdez (Reference Gan and Valdez2017b). We select 680 VA contracts from the dataset as training dataset, select 340 VA contracts as validation dataset and preform prediction on the 190,000 VA contracts. We select validation dataset to speed up the hyperparameter tuning process and model selection. Some summary statistics of the training dataset are provided in Table 2.
Table 2. Summary statistics of the variables for the training dataset. Dollar amounts are in thousands (’000s)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab2.png?pub-status=live)
4.2. Hyperparameter optimisation results
As shown in Table 6, the tree-based models have less computational time. Everything comes with a price. Since building the tree-based models needs to perform hyperparameter tuning, it involves more techniques than traditional statistics methods. As mentioned in Section 3.2, hyperparameter optimisation can be achieved with three schemes. In this section, we provided the grid search results, which provide the best prediction results amongst the three schemes. For results produced by the other two efficient optimisation schemes, readers are referred to Appendix A.
For all the model calibration, we train the model on the training dataset (680 VA contracts). Since the dataset is not large, it allows us to perform grid search and ten-fold cross-validation. All the hyperparameters mentioned below are resulted from the grid search method.
For the regression tree, we grow a full tree using recursive binary splitting with a minimum number of five observations in a region for a split to be attempted. Then we prune this fully grown tree using cost-complexity pruning with “one standard deviation rule” regularisation parameter of 1.084e-02.
In detail, as mentioned in Table 1, we tune two common tuning hyperparameters “cp” and “minsplit” for the regression tree base on the training dataset using ten-fold cross-validation. As described in Section 3.2.1, the manually specified subset of hyperparameter space is the combination of “cp” which has ranged from 0.001 to 1 by the increment of 0.001 and “minsplit” which has ranged from 2 to 10 by the increment of 1. This subset has 9,000 combinations. With ten-fold cross-validation, the computation is heavy while it is feasible since our training dataset is small. We choose the optimal hyperparameter setting based on the minimum cross-validation error. The grid search results are shown in Table 3. Then compare prediction accuracy between the pruned tree which has minimum cross-validation and the pruned tree with “one standard deviation rule” on the validation dataset. Finally, we find the optimal hyperparameter setting as mentioned above. For the rest of the tree-based models, we will not report the detailed process of the grid search.
Figure 1 shows the regression tree plot. The first split is the product type and the following splits use the guaranteed benefit amount. The product type is the most important variable for the regression tree. At each split, if observations meet the split condition, then the algorithm assigns them to the left nodes, otherwise, the algorithm assigns them to the right. In each node, we have information about the number of observations and the prediction. The darker green colour means the larger the value of the prediction. We can also observe that the following nodes depend on the previous splits and this is how the tree-based model detects interaction automatically.
Table 3. Grid search results
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab3.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig1.png?pub-status=live)
Figure 1. A regression tree. Dollar amounts are in thousands $(’000s)$.
Figure 2 shows the out-of-bag (OOB) error according to the number of trees in bagged trees. We can see that the OOB error stabilises after approximately 150 number of trees. Based on our experience, the OOB error stabilises after 200–300 number of trees for the most of the datasets. The bagged trees grows individual trees on the 200 bootstrap samples with 5 minimum numbers of observation at terminal nodes.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig2.png?pub-status=live)
Figure 2. Determining the optimal number of trees in bagged trees.
Figure 3 shows that the test error and the OOB error is minimised at 16, suggesting using all the explanatory variables at each split. In other words, the optimal random forests model is same as bagging.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig3.png?pub-status=live)
Figure 3. Random Forests tuning with the number of random subsets ($m_{try}$).
In this case, random forests is equivalent to bagged trees. Indeed, such a result may appear quite surprising. One way to explain the outcome is the dataset was synthetic dataset which used all the explanatory variables to create response variable, which is described in Gan & Valdez (Reference Gan and Valdez2017b). Each tree requires using the full set of explanatory variables to achieve higher prediction accuracy. Another way to explain the outcome is missing important explanatory variables like the product type and guaranteed benefit amount at each split may decrease the prediction accuracy of random forests. However, as we can see in Figure 3, the marginal decrease in error after 8 of the random subset is minimal.
The final tuned gradient boosted regression trees has 1,000 iterations, 5 maximum depth of variable interactions, 5 minimum numbers of observations in the terminal nodes, and a learning rate of 0.1.
The conditional inference trees applies quadratic test statistics to perform the variable selection and choose the split point, applies Bonferroni adjustments to compute the distribution of the test statistic, and sets 0.04 as the significance level for variable selection with five minimum numbers of observations in the terminal nodes.
Figure 4 shows the simplified conditional inference tree plot. In this figure, the conditional inference tree is held at the maximum depth of three for the convenience of visualisation. The final model has indeed more branches and terminal nodes. However, it has the identical structure with this simplified conditional inference trees for the first few splits. Like the regression tree, the conditional inference tree also uses the product type at the first split. At each node, it shows the selected explanatory variable and its significance level. It has a similar structure to the regression tree, but it can provide the box plot at the terminal nodes thereby allowing the decision maker to pick the desired predicted value. For example, we can use other quantiles to assign the predicted value at the terminal node.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig4.png?pub-status=live)
Figure 4. A conditional inference trees.
The conditional random forest aggregates several hundreds of the conditional inference trees with five minimum observations at the terminal nodes. These trees also select the full set of explanatory variables at each split, which is consistent with the previous situation. In other words, the conditional inference random forests become conditional inference bagged trees.
In summary, we listed the final hyperparameters settings tuned by grid search and further constructed the final models based on these hyperparameters. We will discuss these final models further in the following sections.
4.3. Model performance and computational expense
We apply several validation measures to compare the performance of different tree-based models. In addition, we compare tree-based models with ordinary kriging and GB2 regression from previous studies. See Gan (Reference Gan2013) and Gan & Valdez (Reference Gan and Valdez2018). All validation measures of prediction accuracy used in this comparison are defined in Table 4. Table 5 provides the numerical details about the prediction accuracy of these methods on the 190,000 VA contracts. Tree-based models are traditionally tuned by minimising the MSE. This may have resulted in less desirable values of ME and PE. However, when evaluating model performance particularly in a business setting, it is best to examine and compare several validation measures. Although the values of ME and PE are negative for all tree-based models, the absolute values of ME and PE are small and comparable to those of other models. In addition, gradient boosting performs extremely well in terms of all the other validation measures.
Table 4. Validation measures
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab4.png?pub-status=live)
Table 5. Prediction accuracy of different models
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab5.png?pub-status=live)
In order to visually compare model performance, we introduce the heatmap. Figure 5 provides a heatmap comparing the performance of the models according to the validation measures. The purpose of this heatmap is for ease of comparison so that it has been organised by rescaling all the measures so that for each measure, a value of 100 is the best and a value of 0 is the worst. For Gini, $R^2$, and the CCC index, we know that the higher the value the better, so that for these measures, we find the highest value in each column and scale it to 100. The smallest value is then rescaled to 0. Other values between two boundaries are rescaled according to the distance. For all the other measures, since the smaller its absolute value the better, we multiplicatively invert (take the reciprocal of) the original absolute value and then apply the same idea of rescaling. We also colour coded the figure in the sense that dark red represents the worst and dark blue represents the best. The performance of any model in between is relatively measured according to their degree of closeness to each of these colours. A similar heatmap for such model comparison has appeared in Quan & Valdez (Reference Quan and Valdez2018).
Table 6. Computational expenses
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tab6.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig5.png?pub-status=live)
Figure 5. A heatmap of model performance according to the various validation measures.
We can find that gradient boosting has the overall best performance. It is worth mentioning that conditional inference trees has better prediction performance under the mean error and percentage error, in comparison to other tree-based models. At the portfolio level, the conditional inference trees produces good results. Under the two validation measures ME and PE, the GB2 method outperforms other methods.
Table 6 provides details about the computational expenses, which include training time and prediction time. We find that the tree-based models based on the CART algorithm are more computationally efficient than the conditional inference framework. The extra computational cost associated with conditional inference trees can likely be deduced from the additional step of performing hypothesis test at each split. Moreover, tree-based models outperform other previous methods.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig6.png?pub-status=live)
Figure 6. Variable importance for tree-based models (all the explanatory variables).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig7.png?pub-status=live)
Figure 7. Variable importance for tree-based models (categorical explanatory variables).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig8.png?pub-status=live)
Figure 8. Lift curve plots.
4.4. Variable importance
Figure 6 shows the variable importance for all five tree-based models. The first three models are based on the regression trees that are generated by the traditional CART algorithm. The last two are based on the unbiased conditional inference trees. For random forests (bagged trees), there are two types of variable importance: one based on %IncMSE and another based on IncNodePurity. %IncMSE based on the difference in prediction error after a random permutation values of the explanatory variables on the test set. IncNodePurity can be calculated based on the cumulation of improvement of the loss at each splitting which inherited from single trees. There are also two ways to calculate variable importance for conditional inference tree and conditional random forests. If conditions are considered, the importance of each explanatory variable is computed by permuting within a grid defined by the explanatory variables that are associated to the response variable. The resulting variable importance score is conditional in the same sense as beta coefficients in regression models, but it represents the effect of a variable in both main effects and interactions. We can see that all the tree-based models are able to select the top two important variables, product type and guaranteed benefit amount, while the order of the subsequent variable importance is quite different between models. Interestingly, in contrast to the other methods, the CART did not choose the age of the policyholder and the time to maturity. It is widely recognised that the CART algorithm has the bias in variable selection and this appears in our dataset since we have the type of product with many levels. Controversially, conditional inference tree-based models have a similar selection of variable importance when compared to the bagged trees and gradient boosted regression trees, even though their structures of selecting the split variable are different.
As in the previous discussion, we note that the product type, which has several categories, as deemed most important variable in all tree-based models. Figure 7 shows the dummified categorical variable importance for all five tree-based models. The first three are based on the regression trees that are generated by the traditional CART algorithm. The last two are based on the unbiased conditional inference tree. We observe the similar result that there exists a bias in variable selection in the traditional framework when compared to the conditional inference framework. For example, ABRU, MBRU, and IBRU are the most important product type. Gradient boosting is able to select these same variables.
4.5. Model performance visualisation
Figure 8 shows the lift curves, which are plotted using ordered predictions of the models and benchmark fair market values calculated by Monte Carlo simulation according to ranks. The lift curve is drawn by bining the test sample, 190,000 VA contracts, into 100 segments and taking the average of the ordered predictions and benchmark fair market values. All the average points are then connected. If the curve for the benchmark fair market values closely follows that of the prediction curve, we could say that the model prediction is sound. In examining the lift curve plots for the various models, we observe that bagged trees, gradient boosting, and conditional random forests appear to have the most reliable prediction.
To further visualise a comparison of the performance of the various tree-based models, Figure 9 shows a scatter plot between the fair market values calculated by Monte Carlo simulation and those predicted by metamodels. Beside the scatter plot, we also provide the density plot for both the predicted values and those calculated by Monte Carlo simulation. Not surprisingly, we see that single tree-based models only provide few constant predictions. On the other hand, the ensemble tree-based models provide more smoothed predictions. Overall, gradient boosting provides the best prediction performance.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_fig9.png?pub-status=live)
Figure 9. Prediction and actual fair market values.
5. Concluding Remarks
This paper explores tree-based models and their extensions in developing metamodels for predicting fair market values of the guarantees embedded in VA contracts. Tree-based models are increasingly widely used in various disciplines such as psychology, ecology, biology, economics, and insurance. As demonstrated in this paper, besides computational efficiency and predictive accuracy, they have many advantages as an alternative predictive tool. First, unlike ordinary kriging and the GB2 framework, tree-based models are considered as non-parametric models that do not require distribution assumptions. Second, tree-based models can perform variable selection by assessing the relative importance. Such variable selection is usually essential in actuarial science for purposes such as risk classification and collection of risk variables. This paper further showed how to refine variable selection of a categorical variable with several categories. Third, tree-based models, especially with single smaller sized trees, are straightforward to interpret by a visualisation of the tree structure. This visualisation was illustrated both in the case of regression tree and conditional inference tree. Finally, when compared to other metamodels for prediction purposes, tree-based models require less data preparation as they preserve the original scale to be more interpretable.
To empirically demonstrate the use of tree-based models as a predictive tool, this paper uses the synthetic dataset with 190,000 VA contracts as described in Gan & Valdez (Reference Gan and Valdez2017b). The objective here is to examine the performance of tree-based models in the valuation of fair market values of the product guarantees for large portfolios of VA contracts. In doing so, we compare the predictive accuracy of five tree-based models: regression tree (CART), bagged trees, gradient boosting, conditional inference trees, conditional random forests, as well as two parametric metamodels: ordinary kriging and GB2 models. For model comparison, we use a training dataset based on a representative small sample selected by a clustering algorithm and for computing the model performance, we use the entire dataset as the test dataset. Broadly speaking, tree-based models outperform the parametric metamodels considered in this paper and are generally very efficient in producing more accurate predictions. While in several respects, the gradient boosting ensemble method is considered the most superior amongst all models considered, this paper further demonstrated the many good qualities of conditional inference trees. Such tree-based models can be used for other types of insurance and actuarial applications, especially for datasets that contain categorical explanatory variables with several levels.
Acknowledgements
We thank the Society of Actuaries for the funding support provided in the completion of this research project through our Centers of Actuarial Excellence (CAE) grant on data mining.
A. Other Hyperparameter Optimisation Results
If the training dataset is significantly big, it is not ideal to perform the grid search. Instead, the random search or the Bayesian optimisation are preferred in this situation. In this appendix, we provide the comparison of these two hyperparameter tuning methods.
Table A.1. Random search and Bayesian optimisation
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tabA1.png?pub-status=live)
For the random search, we use caret package to perform 50 random searches on hyperparameter space for all 5 tree-based models. We select the model with minimum ten-fold cross-validation root mean squared error (RMSE) and then find the prediction accuracy for each tree-based model on the whole dataset. Since the random search can be parallel, we use six nodes with this computer configuration (Operating system: Microsoft Windows 10 Education, 64bit; Processor: Intel Core i7-7700K at 4.2 GHz, turbo up to 4.5 GHz, 8 MB cache, four cores; Memory: 32 GB).
For the Bayesian optimisation, the GP regression model applies a radial basis kernel to the covariance function of the multivariate normal prior. The GP regression model needs some initial set of hyperparameters and its model performance and makes predictions over the bounding box of the hyperparameters space. In our setting, there are ten initial sets of hyperparameters. According to this, we can obtain the estimated mean and variance for prediction accuracy (RMSE). Finally, we perform 50 searches applying the Bayesian optimisation using caret package and rBayesianOptimisation package. Again, to be fair in comparison, we select the model with minimum ten-fold cross-validation RMSE and then find the prediction accuracy for each tree-based model on the whole dataset. Bayesian optimisation is performed without parallelism.
Table A.1. provides details about the training (680 contracts) mean square error (TrainMSE), the testing (whole contracts) mean square error (TestMSE), and computational expenses (Time) which include training time (hyperparameters tuning) and prediction time. We can see that the Bayesian optimisation is slower than the random search. Bayesian optimisation can provide better hyperparameter setting in the case of complex situations like gradient boosting with four hyperparameters. We should point out that the results shown in Table A.1 are based on simple illustration and comparison of using the random search and Bayesian optimisation to optimise various tree-based models’ hyperparameters. The prediction results should not be compared directly with those based on the grid search that are shown in Table 5 since the dimension of the grid is not comparable.
In practice, for the large-scale dataset, we can apply a few random searches and send results to the initial setting for the Bayesian optimisation. After performing the Bayesian optimisation, we can have a better understanding of the dataset and perform the grid search on the hyperparameter space which is narrowed down by the previous experiment.
B. Modelling Partial Greeks of Variable Annuities
The synthetic dataset created in Gan & Valdez (Reference Gan and Valdez2017b) also contains the partial dollar deltas on five market indices. We apply the same techniques described in Section 4 to model these five partial dollar deltas, which were studied by Gan & Lin (Reference Gan and Lin2017) and Gan & Valdez (Reference Gan and Valdez2017a).
Tables B.1, B.2, and B.3 show the prediction accuracy of the tree-based models and traditional approaches on the 190,000 VA contracts on five dollar deltas. The best performance numbers are shown in boldface. From these tables, we observe that gradient boosting has consistent good performance in terms of $R^2$. However, bagged trees perform better than gradient boosting for Delta2. From Tables B.3 and B.2, it can be more challenging to judge the overall best model. Nevertheless, tree-based models have good prediction performance on the Greeks of VA portfolios with respect to ME and PE measures.
Table B.1. Prediction accuracy of deltas, based on $R^2$
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tabB1.png?pub-status=live)
Table B.2. Prediction accuracy of deltas, based on ME
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tabB2.png?pub-status=live)
Table B.3. Prediction accuracy of deltas, based on PE
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220329105624827-0132:S1748499521000075:S1748499521000075_tabB3.png?pub-status=live)