LRMoE.jl: a software package for insurance loss modelling using mixture of experts regression model

Spark C. Tseung; Andrei L. Badescu; Tsz Chai Fung; X. Sheldon Lin

doi:10.1017/S1748499521000087

LRMoE.jl: a software package for insurance loss modelling using mixture of experts regression model

Published online by Cambridge University Press: 18 March 2021

Tsz Chai Fung and

Spark C. Tseung: Affiliation:
Department of Statistical Sciences, University of Toronto, 100 St George Street, Toronto, ONM5S 3G3, Canada
Andrei L. Badescu: Affiliation:
Department of Statistical Sciences, University of Toronto, 100 St George Street, Toronto, ONM5S 3G3, Canada
Tsz Chai Fung: Affiliation:
RiskLab, Department of Mathematics, ETH Zurich, Zurich 8092, Switzerland, Department of Risk Management and Insurance, Georgia State University, Atlanta, GA30303, USA
X. Sheldon Lin*: Affiliation:
Department of Statistical Sciences, University of Toronto, 100 St George Street, Toronto, ONM5S 3G3, Canada
*: *Corresponding author. E-mail: sheldon@utstat.utoronto.ca

Article contents

Abstract
Introduction
LRMoE Model and Parameter Estimation
Example: Simulated Dataset
Example: Real Dataset
Summary and Outlook
References

Rights & Permissions

Abstract

This paper introduces a new julia package, LRMoE, a statistical software tailor-made for actuarial applications, which allows actuarial researchers and practitioners to model and analyse insurance loss frequencies and severities using the Logit-weighted Reduced Mixture-of-Experts (LRMoE) model. LRMoE offers several new distinctive features which are motivated by various actuarial applications and mostly cannot be achieved using existing packages for mixture models. Key features include a wider coverage on frequency and severity distributions and their zero inflation, the flexibility to vary classes of distributions across components, parameter estimation under data censoring and truncation and a collection of insurance ratemaking and reserving functions. The package also provides several model evaluation and visualisation functions to help users easily analyse the performance of the fitted model and interpret the model in insurance contexts.

Keywords

Multivariate regression analysis Censoring and truncation Expectation conditional maximisation algorithm Insurance ratemaking and reserving julia

Type: Paper
Information: Annals of Actuarial Science , Volume 15 , Issue 2 , July 2021 , pp. 419 - 440

DOI: https://doi.org/10.1017/S1748499521000087 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries

1. Introduction

The Logit-weighted Reduced Mixture-of-Experts (LRMoE) is a flexible regression model introduced by Fung et al. (Reference Fung, Badescu and Lin2019b), which is regarded as the regression version of finite mixture model with the mixing weights (called the gating function) which depend on the covariates. We may interpret the LRMoE as clustering policyholders into different subgroup with varying probabilities. Conditioned on the subgroup component to which each policyholder is assigned, the distributional properties of loss frequency or severity are governed by mixture component functions (called the expert function). Model flexibility, parsimony and mathematical tractability are justified (see Fung et al. Reference Fung, Badescu and Lin2019b), demonstrating a sound theoretical foundation of LRMoE in a general insurance loss modelling perspective. Considering some specific choices of expert functions, Fung et al. (Reference Fung, Badescu and Lin2019a) and Fung et al. (forthcoming) construct Expectation Conditional Maximisation (ECM) algorithms for efficient frequency and severity model calibrations and show potential usefulness of LRMoE in terms of insurance ratemaking and reserving.

While the existing R package flexmix (Leisch Reference Leisch2004 and Grün & Leisch Reference Grün and Leisch2008) may perform parameter estimation for some special cases of LRMoE, it offers only limited choices of component functions (Poisson, Gaussian, Gamma and Binomial) for model fitting. Miljkovic & Grün (Reference Miljkovic and Grün2016) have used its extensibility feature to prototype new mixture models with alternative component functions (such as Lognormal, Weibull and Burr), but users are still constrained to choosing a single parametric distribution for all the components.

This paper introduces a new package in julia language, LRMoE, a statistical software tailor-made for actuarial applications, which allows actuarial researchers and practitioners to model and analyse insurance loss frequencies and severities using the LRMoE model. The package offers several new distinctive features, which are motivated by various actuarial applications and mostly cannot be achieved by existing packages, including:

• Fast fitting: With the increasing need to analyse large insurance datasets with hundreds of thousands of observations, it is crucial that a statistical package can fit models within a reasonable time frame. Compared with traditional languages such as R, our implementation in julia significantly shortens the runtime, allowing users to obtain and analyse results much faster (see section 4.1 for comparison).
• Wider coverage on frequency and severity distributions: Apart from the severity distributions covered by Miljkovic & Grün (Reference Miljkovic and Grün2016), the package also covers more frequency expert functions important for actuarial loss modelling, including negative binomial distribution and gamma-count distribution.
• Zero-inflated distributions: Often actuaries are more interested in analysing the aggregate loss for each policyholder instead of considering frequency and severity separately. In this situation, it is common in practice to observe excessive zeroes, which motivates the use of zero-inflated expert functions in the LRMoE, which is offered in this package. Note that efficient computation of zero-inflated LRMoE requires defining an additional latent variable (see section 2.2 for details), further hindering the effectiveness of using the extensibility feature of flexmix.
• Package extensibility: In addition to providing a wide coverage of distributions, our package also allows users to define their customised expert functions with simple guidance from the package documentation. Hence, the package can be used not only within the actuarial community, but also in a wider range of research and practical problems.
• Varying classes of distributions across components: Insurance loss data may exhibit a mismatch between body and tail behaviours, which should be captured using different distributions. One approach is to choose two distributions and combine them using a peaks-over-threshold method (see, e.g. Lee et al. Reference Lee, Li and Wong2012 and Scollnik & Sun Reference Scollnik and Sun2012). Another is to consider a finite mixture model based on different component distributions (see, e.g. Blostein & Miljkovic Reference Blostein and Miljkovic2019). The LRMoE package is similar to the latter, and users can select different expert functions across different mixture components, which allows for more flexible and realistic modelling of data.
• Incomplete data: In many actuarial applications, including reinsurance, operational risk management, deductible ratemaking and loss reserving, censored and truncated data are often observed and need to be dealt with. Censoring and truncation of LRMoE is introduced by Fung et al. (Reference Fung, Badescu and Lin2020a) with the expert functions restricted to univariate gamma distribution. The new package removes such restriction by offering users versatility to fit randomly censored and truncated multivariate data with many choices of expert functions.
• Model selection and visualisation: In addition to model fitting function, the new package also provides several model evaluation (AIC, BIC) and visualisation (e.g. latent class probabilities, covariate influence) functions to help users easily analyse the performance of the fitted model and interpret the fitted model in the insurance context.
• Insurance ratemaking and reserve calculation: The package further contains a number of pricing and reserving functions (e.g. mean, variance, value-at-risk (VaR), conditional tail expectation (CTE)), which enable actuaries to simultaneously perform ratemaking to multiple insurance contracts with different characteristics, based on abundant choices of premium principles.

The paper is organised as follows. Section 2 reviews the LRMoE model and parameter estimation using the ECM algorithm. In section 3, we use a simulated dataset to demonstrate the basic fitting procedure in the LRMoE package. Section 4 contains more package utilities such as parameter initialisation, model visualisation and pricing function, which are illustrated using a French auto insurance dataset. The paper is concluded with some remarks in section 5. For brevity, we only present code lines which are the most relevant to our new package. The source code, package documentation and complete replication code for all examples in this paper are available on https://github.com/sparktseung/LRMoE.jl and https://sparktseung.github.io/LRMoE.jl/dev/. Further, we have developed a corresponding R package (accelerated by Rcpp) for fitting LRMoE with similar functionalities for users interested in running the package in R instead. We refer such readers to Tseung et al. (Reference Tseung, Badescu, Fung and Lin2020) for the vignette, and https://github.com/sparktseung/LRMoE for the code and documentations.

2. LRMoE Model and Parameter Estimation

In this section, we provide a brief overview of the LRMoE model proposed in Fung et al. (Reference Fung, Badescu and Lin2019b), and discuss the ECM algorithm for parameter estimation. For brevity of presentation, we will assume, in sections 2.1 and 2.2, that all response variables (claim frequency or severity) are observed exactly. In section 2.3, we will address data truncation and censoring for the LRMoE model.

2.1. Logit-weighted reduced mixture of experts

Let ${\textit{\textbf{x}}}_{i} = (x_{i0}, x_{i1}, \dots, x_{iP})^{T}$ denote the $(P+1)$-dimensional covariate vector for policyholder i ($i=1, 2, \dots, n$) with intercept $x_{i0} = 1$, and ${\textit{\textbf{y}}}_{i} = (y_{i1}, y_{i2}, \dots, y_{iD})^{T}$ denote the D-dimensional vector of their response variables, which can be either claim frequency or severity. Let ${\textit{\textbf{X}}} = ({\textit{\textbf{x}}}_{1}, {\textit{\textbf{x}}}_{2}, \dots, {\textit{\textbf{x}}}_{n})^{T}$ and ${\textit{\textbf{Y}}} = ({\textit{\textbf{y}}}_{1}, {\textit{\textbf{y}}}_{2}, \dots, {\textit{\textbf{y}}}_{n})^{T}$ denote all covariates and responses for a group of n policyholders.

Based on the covariates, policyholder i is classified into one of the g latent risk classes by a logit gating function

(1)

\begin{equation} \pi_{j}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}) = \frac{exp({\boldsymbol{\alpha}}^{T}_{j}{\textit{\textbf{x}}}_{i})}{\sum_{j^{\prime}=1}^{g} exp({\boldsymbol{\alpha}}^{T}_{j^{\prime}}{\textit{\textbf{x}}}_{i}) }, \quad j = 1, 2, \dots, g\end{equation}

where ${\boldsymbol{\alpha}}_{j} = (\alpha_{j0}, \alpha_{j1}, \dots, \alpha_{jP})^{T}$ is a vector of regression coefficients for latent class j.

Given the assignment of latent class $j \in \{ 1, 2, \dots, g \}$, the response variables ${\textit{\textbf{y}}}_{i} = (y_{i1}, y_{i2}, \dots, y_{iD})^{T}$ are conditionally independent. For $d = 1, \dots, D$ and $i=1,\ldots,n$, the dth dimension probability density function (or probability mass function) of $y_{id}$ is given by

(2)

\begin{equation} g_{jd}(y_{id}; \delta_{jd}, {\boldsymbol{\psi}}_{jd}) = \delta_{jd}\boldsymbol{1}\{ y_{id}=0\} + (1-\delta_{jd})f_{jd}(y_{id}; {\boldsymbol{\psi}}_{jd})\end{equation}

where $\delta_{jd}\in [0,1]$ is a parameter representing the probability of assigning an observation into a zero-inflated component (i.e. a probability mass at zero), $\boldsymbol{1}\{ y_{id}=0\}$ is the indicator function, and $f_{jd}(y_{id}; {\boldsymbol{\psi}}_{jd})$ is the density of some commonly used distribution with parameters ${\boldsymbol{\psi}}_{jd}$ for modelling insurance losses. Table 1 gives a list of parametric distributions supported by the LRMoE package. Note that a probability mass $\delta_{jd}$ at zero allows for more realistic modelling of insurance data which usually exhibit excess zeros. As a naming convention, we will refer to $g_{jd}$ as a component distribution/expert function, and $f_{jd}$ as its positive part, although claim frequency distributions (e.g. Poisson) also have some probability mass at zero.

Table 1. Distributions supported by LRMoE.

Under LRMoE, the conditional probability density function (or probability mass fuction) of ${\textit{\textbf{y}}}_i$ given covariates ${\textit{\textbf{x}}}_i$ is

(3)

\begin{equation} h({\textit{\textbf{y}}}_i; {\textit{\textbf{x}}}_i, {\boldsymbol{\alpha}}, {\boldsymbol{\delta}}, {\boldsymbol{\Psi}}) = \sum_{j=1}^{g} \left[ \pi_{j}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}) \times \prod_{d=1}^{D} g_{jd}(y_{id}; \delta_{jd}, {\boldsymbol{\psi}}_{jd}) \right]\end{equation}

where ${\boldsymbol{\alpha}} = ({\boldsymbol{\alpha}}_{1}, {\boldsymbol{\alpha}}_{2}, \dots, {\boldsymbol{\alpha}}_{g})^{T}$ is a $g \times (P+1)$-matrix of mixing weights with ${\boldsymbol{\alpha}}_{g} = (0, 0, \dots, 0)^{T}$ representing the default class, ${\boldsymbol{\delta}} = (\delta_{jd})_{1 \leq j \leq g, 1 \leq d \leq D}$ is a $(g \times D)$-matrix of probability masses at zero by component and by dimension, and ${\boldsymbol{\Psi}} = \{ {\boldsymbol{\psi}}_{jd}: 1 \leq j \leq g, 1 \leq d \leq D \}$ is a list of parameters for the positive part $f_{jd}$ by component and by dimension. Note that ${\boldsymbol{\alpha}}_{g} = (0, 0, \dots, 0)^{T}$ ensures that the model is identifiable, i.e. there is a one-to-one mapping between the regression distributions and the parameters (see Jiang & Tanner Reference Jiang and Tanner1999 and Fung et al. Reference Fung, Badescu and Lin2019a). Note that the total number of parameters of the model is given by $(g-1)\times (P+1)+g\times D+\sum_{j=1}^g\sum_{d=1}^{D}\#\{{\boldsymbol{\psi}}_{jd}\}$, where $\#\{{\boldsymbol{\psi}}_{jd}\}$ is the length of ${\boldsymbol{\psi}}_{jd}$.

For a group of n policyholders, the likelihood function is given by

(4)

\begin{equation} L({\boldsymbol{\alpha}}, {\boldsymbol{\delta}}, {\boldsymbol{\Psi}}; {\textit{\textbf{X}}}, {\textit{\textbf{Y}}}) = \prod_{i=1}^{n} h({\textit{\textbf{y}}}_i; {\textit{\textbf{x}}}_i, {\boldsymbol{\alpha}}, {\boldsymbol{\delta}}, {\boldsymbol{\Psi}}) = \prod_{i=1}^{n}\left\{ \sum_{j=1}^{g} \left[ \pi_{j}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}) \times \prod_{d=1}^{D} g_{jd}(y_{id}; \delta_{jd}, {\boldsymbol{\psi}}_{jd}) \right] \right\}\end{equation}

2.2. Parameter estimation

For parameter estimation in finite mixture models, the Expectation–Maximisation (EM) algorithm is commonly used (see, e.g. Dempster et al. Reference Dempster, Laird and Rubin1977 and McLachlan & Peel Reference McLachlan and Peel2004). However, for the LRMoE model, the M-step requires maximisation of a non-concave function over all elements of ${\boldsymbol{\alpha}}$. Fung et al. (Reference Fung, Badescu and Lin2019a) thus use the ECM algorithm (Meng & Rubin Reference Meng and Rubin1993), which breaks the M-step into several substeps. The ECM algorithm implemented in the LRMoE package is described as follows.

Denote ${\boldsymbol{\Phi}} = ({\boldsymbol{\alpha}}, {\boldsymbol{\delta}}, {\boldsymbol{\Psi}})$ as the parameters to estimate. For $i = 1, 2, \dots, n$, we introduce a latent random vector ${\textit{\textbf{Z}}}_{i} = (Z_{i1}, Z_{i2}, \dots, Z_{ig})^{T}$, where $Z_{ij} = 1$ if ${\textit{\textbf{y}}}_{i}$ comes from the jth component distribution and $Z_{ij} = 0$ otherwise. For $d = 1, 2, \dots, D$, we further write $Z_{ij} = Z_{ijd0} + Z_{ijd1}$, where $Z_{ijd0}= 0$ and $Z_{ijd1}= 1$ if the dth dimension of ${\textit{\textbf{y}}}_{i}$ comes from the positive part $f_{jd}$ of the jth component, and $Z_{ijd0}= 1$ and $Z_{ijd1}= 0$ if it comes from the zero inflation $\delta_{jd}$.

The complete-data log-likelihood function is given by

(5)

\begin{equation}\begin{aligned} l^{\textrm{com}}\left({\boldsymbol{\Phi}}; {\textit{\textbf{X}}}, {\textit{\textbf{Y}}}, {\textit{\textbf{Z}}}\right) & = \sum_{i=1}^{n}\sum_{j=1}^{g} Z_{ij} \left\{ \log \pi_{j}\left({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}\right) + \sum_{d=1}^{D}\log g_{jd}(y_{id}; \delta_{jd}, {\boldsymbol{\psi}}_{jd}) \right\} \\ & = \sum_{i=1}^{n}\sum_{j=1}^{g} Z_{ij} \log \pi_{j}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}) + \sum_{i=1}^{n}\sum_{j=1}^{g}\sum_{d=1}^{D} \left\{ Z_{ijd0}\log \delta_{jd} + Z_{ijd1}\log (1- \delta_{jd}) \right\} \\ & + \sum_{i=1}^{n}\sum_{j=1}^{g}\sum_{d=1}^{D} Z_{ijd1}\log f_{jd}\left(y_{id}; {\boldsymbol{\psi}}_{jd}\right)\end{aligned}\end{equation}

E-Step:

For each $i=1, 2, \dots, n$, the random vector ${\textit{\textbf{Z}}}_{i}$ follows a multinomial distribution with count 1 and probabilities $(\pi_{1}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{1}), \pi_{2}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{2}), \dots, \pi_{g}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{g}))$. Given $Z_{ij}=1$, the conditional distribution of $Z_{ijd0}$ is Bernoulli with probability $\delta_{jd}$. Hence, at the tth iteration, the posterior expectations of $Z_{ij}$, $Z_{ijd0}$ and $Z_{ijd1}$ are

(6)

\begin{equation}\begin{aligned} z^{(t)}_{ij} & = \texttt{E}\{ Z_{ij} | {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}} \} = \texttt{P}\{ Z_{ij} = 1 | {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}} \} \\ & = \frac{ \pi_{j}\left({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}^{(t-1)}\right) \times \prod_{d=1}^{D} g_{jd}\left(y_{id}; \delta_{jd}^{(t-1)}, {\boldsymbol{\psi}}_{jd}^{(t-1)}\right) }{\sum_{j^{\prime}=1}^{g} \pi_{j^{\prime}}\left({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j^{\prime}}^{(t-1)}\right) \times \prod_{d=1}^{D} g_{j^{\prime}d}\left(y_{id}; \delta_{j^{\prime}d}^{(t-1)}, {\boldsymbol{\psi}}_{j^{\prime}d}^{(t-1)}\right)},\end{aligned}\end{equation}

(7)

\begin{equation}\begin{aligned} z^{(t)}_{ijd0} & = \texttt{E}\{ Z_{ijd0} | {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}} \} \\ & = \texttt{P}\{ Z_{ijd0} = 1 | {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}}, Z_{ij} = 1 \} \times \texttt{P}\{ Z_{ij} = 1 | {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}} \} \\ & = \frac{\delta_{jd}^{(t-1)} \boldsymbol{1}\{y_{id}=0\} }{\delta_{jd}^{(t-1)} \boldsymbol{1}\{y_{id}=0\} + \left(1-\delta_{jd}^{(t-1)}\right)F_{jd}\left(0; {\boldsymbol{\psi}}_{jd}^{(t-1)}\right)} \times z^{(t)}_{ij}\end{aligned}\end{equation}

and

(8)

\begin{equation}\begin{aligned} z^{(t)}_{ijd1} & = z^{(t)}_{ij} - z^{(t)}_{ijd0}\end{aligned}\end{equation}

where $F_{jd}$ is the cumulative distribution function of the positive part $f_{jd}$.

CM-Step:

In the CM-step, we aim to maximise $Q({\boldsymbol{\Phi}}; {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}})$, which can be decomposed into three parts as

(9)

\begin{equation} Q({\boldsymbol{\Phi}}; {\boldsymbol{\Phi}}^{(t-1)}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}}) = Q^{(t)}_{{\boldsymbol{\alpha}}} + Q^{(t)}_{{\boldsymbol{\delta}}} + Q^{(t)}_{{\boldsymbol{\Psi}}}\end{equation}

where

(10)

\begin{equation} Q^{(t)}_{{\boldsymbol{\alpha}}} = \sum_{i=1}^{n}\sum_{j=1}^{g} z^{(t)}_{ij} \log \pi_{j}({\textit{\textbf{x}}}_{i}; {\boldsymbol{\alpha}}_{j}),\end{equation}

(11)

\begin{equation} Q^{(t)}_{{\boldsymbol{\delta}}} = \sum_{i=1}^{n}\sum_{j=1}^{g}\sum_{d=1}^{D} \left\{ z^{(t)}_{ijd0}\log \delta_{jd} + z^{(t)}_{ijd1}\log (1- \delta_{jd}) \right\}\end{equation}

and

(12)

\begin{equation} Q^{(t)}_{{\boldsymbol{\Psi}}} = \sum_{i=1}^{n}\sum_{j=1}^{g}\sum_{d=1}^{D} z^{(t)}_{ijd1}\log f_{jd}\left(y_{id}; {\boldsymbol{\psi}}_{jd}\right)\vspace*{-4pt}\end{equation}

To maximise $Q^{(t)}_{{\boldsymbol{\alpha}}}$, we use the same conditional maximisation as described in Fung et al. (Reference Fung, Badescu and Lin2019a). We first maximise it with respect to ${\boldsymbol{\alpha}}_{1}$ with ${\boldsymbol{\alpha}}_{j}$ fixed at ${\boldsymbol{\alpha}}_{j}^{(t-1)}$ for $j = 2, 3, \dots, g-1$. The next step is to maximise with respect to ${\boldsymbol{\alpha}}_{2}$ with updated ${\boldsymbol{\alpha}}_{1}^{(t)}$ and other ${\boldsymbol{\alpha}}_{j}$ fixed at ${\boldsymbol{\alpha}}_{j}^{(t-1)}$ for $j = 3, 4, \dots,\break g-1$. The process continues until all ${\boldsymbol{\alpha}}$’s have been updated. For obtaining each ${\boldsymbol{\alpha}}_{j}^{(t)}$, the Iteratively Reweighted Least Square (IRLS) approach (Jordan & Jacobs Reference Jordan and Jacobs1994) is used until convergence.

For $Q^{(t)}_{{\boldsymbol{\delta}}}$, each $\delta_{jd}$ can be updated using the following closed-form solution

(13)

\begin{equation} \delta_{jd}^{(t)} = \frac{\sum_{i=1}^{n} z^{(t)}_{ijd0}}{\sum_{i=1}^{n} \left(z^{(t)}_{ijd0} + z^{(t)}_{ijd1}\right)}\end{equation}

The maximisation of $Q^{(t)}_{{\boldsymbol{\Psi}}}$ can also be divided into smaller problems by component j and by dimension d. For each $j=1,\ldots,g$ and $d=1,\ldots,D$, the problem is reduced to maximise a weighted log-likelihood of an expert function where the weights of each observations are given by $z^{(t)}_{ijd1}$. For updating each ${\boldsymbol{\psi}}^{(t)}_{jd}$, therefore, closed-form solutions are only available for very special distributions (e.g. Poisson, Lognormal). Numerical optimisation is used in most cases, especially when the observation ${\textit{\textbf{y}}}_{i}$’s are not observed exactly (see section 2.3).

As discussed in McLachlan & Peel (Reference McLachlan and Peel2004), mixture of severity distributions may have unbounded likelihood, which leads to spurious models with extremely large or small parameter values. In the LRMoE package, we adopt the same maximum a posteriori (MAP) approach in Fung et al. (forthcoming), which uses appropriate prior distributions to penalise the magnitude of fitted parameters (see section 3). The rationale of including penalty functions is to avoid obtaining spurious model due to unbounded nature of log-likelihood function. The penalty itself should be small enough so that it results to negligible impacts on the fitted model. On the other hand, the penalty functions avoid parameters diverging to unreasonable values so that the fitted model would become more robust. For more details regarding to the rationales and executions, readers are recommended to refer to section 4 of Fung et al. (forthcoming).

2.3. LRMoE with censoring and truncation

Censoring and truncation are common in insurance data sets and need to be dealt with. For example, when a policy limit is applied, loss amounts above the limit will be recorded as the limit only, which creates right censoring of the complete loss data; when a policy deductible is applied, loss amounts below the deductible are not reported to the insurer, thus leading to left truncation.

Fung et al. (Reference Fung, Badescu and Lin2020a) have discussed the LRMoE model with censoring and truncation, where all component distributions are Gamma. For parameter estimation with data censoring and truncation, the ECM algorithm in section 2.2 is slightly modified, with an additional E-step to remove the uncertainties arising from censoring and truncation. Since the main purpose of this paper is to demonstrate the application of LRMoE package, we will omit the details and refer interested readers to the cited paper.

For all distributions included in it, the LRMoE package can perform parameter estimation in the presence of data truncation and censoring. Consequently, the user’s input is slightly different from existing packages for mixture models. A detailed example on model fitting in our package is given in section 3.

2.4. Ratemaking and reserving in LRMoE

The model structure of LRMoE allows for easy computation of quantities relevant to actuarial ratemaking and reserving (see Fung et al. Reference Fung, Badescu and Lin2019b). At the policyholder level, the moments and common measures of dependence (e.g. Kendall’s tau and Spearman’s rho) of ${\textit{\textbf{y}}}_i$ can be computed in simple forms. The value-at-risk (VaR) and conditional tail expectation (CTE) can also be numerically solved without much difficulty. Various premium principles can be applied to price insurance contracts, including pure premium, standard deviation (SD) premium, limited expected value (LEV) and stop-loss (SL) premium. Risk measures can also be calculated for each individual policyholder (e.g. 99% VaR). At the portfolio level, simulation can be conducted to obtain the distribution of the aggregate loss of all policyholders, which is useful for calculating the total loss reserve and premium calculation. The simulation process is facilitated by a data simulator included in our package (see section 4.4).

3. Example: Simulated Dataset

In this section, we will demonstrate how to fit an LRMoE model using a simulated dataset which accompanies the package. The variables of this simulated dataset are described in Table 2, and the dataset is generated by an LRMoE model given in Table 3. The package and dataset can be installed and loaded as follows.

To address data truncation and censoring, the user’s input of response ${\textit{\textbf{Y}}}$ is different from existing packages. For each dimension d of observation ${\textit{\textbf{y}}}_{i}$, instead of a single numeric input, a quadruple $0 \leq t_{id}^{l} \leq y_{id}^{l} \leq y_{id}^{u} \leq t_{id}^{u} \leq \infty$ is needed, where $t_{id}^{l}$ and $t_{id}^{u}$ are the lower and upper bounds of truncation, and $y_{id}^{l}$ and $y_{id}^{u}$ are the lower and upper bounds of censoring. The exact value of $y_{id}$ lies between censoring bounds such that $y_{id}^{l} \leq y_{id} \leq y_{id}^{u}$. For a sample of size n, an $n \times (4D)$-matrix is needed, where each $ n\times 4$-block describes one dimension of ${\textit{\textbf{Y}}}$.

Sample rows of $\texttt{Y\_obs}$ in the demo dataset are shown as follows. These rows of data illustrate three possible scenarios of data truncation or censoring. For the first row, both dimensions of ${\textit{\textbf{y}}}_i$ are observed exactly without truncation, so $t_{i1}^l = t_{i2}^l = 0$, $y_{i1}^l = y_{i1}^u= y_{i1}$, $y_{i2}^l = y_{i2}^u = y_{i2}$ and $t_{i1}^u = t_{i2}^u = \infty$. For the second row, the second dimension of ${\textit{\textbf{y}}}_i$ is truncated at 5 (e.g. by imposing a policy deductible) but not censored, so $t_{i2}^l = 5$. For the third row, the second dimension of ${\textit{\textbf{y}}}_i$ is right-censored at 100 (e.g. by applying a policy limit) and the exact value of $y_{i2}$ is unknown, so $y_{i2}^l = 100$ and $y_{i2}^u = \infty$.

To fit an LRMoE model, the user only needs to minimally specify the following: initial guess of logit regression coefficients (alpha_init) and what component distributions to use for each dimension and each component, as well as initial guess of their parameters (comp_init).

For illustration purposes, we first assume that the user’s choice of component distributions coincides with the true model, and the initial guesses of parameters are also close to the true ones. The following sample code provides an example: With two components and five covariates, the initial guess alpha_init is a $2\times5$ matrix, where entries of zero indicate an non-informative guess. As for the component distributions, comp_init is a $2\times2$ matrix, where the number of rows (or columns) corresponds to the number of components (or dimension of response). Each entry of comp_init indicates a choice of expert function. In this case, $y_{i1}$ is a mixture of Poisson with mean $\lambda = 10.0$ and zero-inflated Gamma-count distribution with zero inflation $\delta = 0.50$, shape parameter $m = 40$ and dispersion parameter $s = 0.80$. Similarly, $y_{i2}$ is a mixture of $\textrm{Lognormal}(\mu = 3.0, \sigma = 1.0)$ and $\textrm{Inverse Gaussian}(\mu = 15.0, \lambda = 15.0)$.

Table 2. Description of demo dataset.

Note 1: All covariates are generated independently and uniformly at random.

Note 2: The complete dataset (X, Y) contains 10,000 rows. As a result of left-truncating Y[,2], 163 rows of data are discarded, and the observed dataset (X_obs, Y_obs) has 9837 rows only.

Table 3. True model of demo dataset.

The fitting function of LRMoE can be called as follows, which will return a fitted model as well as log-likelihood and information criteria AIC and BIC. The result can be inspected by a summary() function, or by standard julia methods.

The fitted model is summarised in Table 4. The parameter estimates are quite close to the true ones. Considering simulated random noises and loss of information due to censoring and truncation, the fitting function is able to identify the true model when it is known.

Table 4. Fitted model 1 of demo dataset.

Table 5. Fitted model 2 of demo dataset.

In practice, when the true underlying model is not known, the user needs to perform some preliminary analysis on the dataset to determine model specification and parameter initialisation. Table 5 contains the parameter estimates of another user-specified LRMoE model for the demo dataset, which has quite different component distributions compared with the true model.

The fitted log-likelihood values for Model 1 and Model 2 are -73,147.35 and -74,491.24, respectively. A graphical comparison of the two models are given in Figure 1. While both models have similar fitting performance for claim severity, Model 2 is noticeably worse in fitting small and extreme values of claim frequency.

Figure 1. Fitting results of DemoData.

Table 6. Description of French auto insurance data.

4. Example: Real Dataset

In this section, we illustrate how to fit an LRMoE model to a real insurance dataset. As the basic fitting procedure has been discussed in the previous section, we will focus on other utilities of our package, including parameter initialisation, model uncertainty, simulation, actuarial pricing functions and model visualisation.

4.1. Dataset and computation time

Throughout this section, we will use a French auto insurance claims dataset freMTPLfreq and freMTPLsev included in the R package CASdatasets (Dutang & Charpentier, Reference Dutang and Charpentier2019). Exploratory analysis and data cleaning procedures can be found on the accompanying website of our package. The cleaned dataset has 412,609 observations. The claim amount has high zero inflation, as less than 4% of policyholders have filed at least one claim. For positive claim amounts, the distribution is right-skewed, multimodal and heavy-tailed. The covariates used for modelling are described in Table 6.

Before proceeding, we remark on the computational time of our package. Compared with the demo dataset in section 3, analysing freMTPLfreq and freMTPLsev resembles a realistic actuarial modelling problem with many covariates and observations. In such case, the julia programming language is significantly more advantageous compared with traditional statistical languages such as R. Some standard benchmarks can be found on https://julialang.org/benchmarks/. Our experience confirms julia’s better performance: it takes around 20 hours in R (with optimisation in Rcpp) to fit a model in this section, while julia takes less than 5 hours to complete the same procedure. For the corresponding R packages which we have developed, we refer readers to Tseung et al. (Reference Tseung, Badescu, Fung and Lin2020).

4.2. Parameter initialisation

Since the fitting procedure of LRMoE involves multivariate optimisation, a good initialisation of parameters will often lead to faster convergence, compared with a non-informative guess. Gui et al. (Reference Gui, Huang and Lin2018) have proposed an initialisation procedure for fitting mixture of Erlang distributions, which involves k-means clustering and clusterised method of moments (CMM). This has been used in Fung et al. (forthcoming) and offers reasonably good starting values of parameters.

Our package contains an initialisation function which applies the CMM method to all component distributions. Some preliminary analysis is needed to determine the number of clusters (components) to use. Since the positive part of all distributions included in our package is unimodal, a heuristic starting point is to examine the empirical histogram of data and count the number of peaks (see Figure 3).

As an example, the procedure to initialise a three-component LRMoE is given as follows. The user needs to input response Y and covariate X, as used in the fitting function. In addition, the third argument 3 corresponds to the number of components, while the last argument [“continuous”] indicates that the response Y is one-dimensional and continuous. For the demo dataset used in section 3, the input would be [“discrete” “continuous”].

The cmm_init function will return a list of parameter initialisation. For the logit regression coefficients, we assume a non-informative guess on covariate influence, resulting in zero coefficients on all covariates. The user is free to incorporate prior knowledge by modification afterwards. The relative size of each cluster is reflected by the intercept terms. In the following initialisation, the size of three latent clusters is proportional to $(e^{-1.41667}, e^{-2.8687}, e^{0.0})$, i.e. the proportion of these clusters within the entire dataset is given by $(0.3842, 0.0330,\break 0.5827)$.

As for the parameters of component distributions, for each dimension of Y, we will apply the CMM method to obtain initialisation for all types of expert functions. A summary of all initialisation is given in Table 7. The initialisation function cmm_init also returns summary statistics (mean, coefficient of variation, skewness and kurtosis), which helps with choosing what combination of expert functions to use. Initialisation with extremely large or small parameter values could result in a spurious model or bad fit, thus should be avoided.

Table 7. Sample initialisation of parameters.

Note: Since the moments cannot be written in closed form, these parameters are solved by maximising log-likelihood based on the observation in clusters.

The cmm_init function also returns two suggested models based on the highest log-likelihood (ll_best) and the best Kolmogorov–Smirnov test (ks_best). For example, the model initialisation with the highest log-likelihood is given by the following.

4.3. Fitting results and model selection

For illustration purposes, we fit only a selected number of LRMoEs to the entire French auto insurance dataset. The fitted log-likelihood, AIC and BIC are summarised in Table 8. In most cases, adding more components will increase the fitted log-likelihood (e.g. consider models iwl, will and wllil). The models selected by AIC and BIC have six and five components, respectively. In this particular setting, BIC heavily penalises models with more components, since the sample size is large, and adding one component roughly increases the number of parameters by 30 (the number of logit regression coefficients).

Table 8. Fitting results of French auto insurance data.

Note 1: Component distributions are represented by first letter, for example, l for Lognormal, b for Burr, etc. All components are zero-inflated. The number in brackets is the total number of parameters.

Note 2: Stopping criterion is $<0.05$ improvement in log-likelihood, or 500 iterations.

Note 3: For each g, log-likelihood values are in decreasing order. The overall optimal values are bolded.

Apart from AIC and BIC, cross-validation (CV) is an alternative model selection criterion which avoids overfitting with too many latent components. For example, Gui et al. (Reference Gui, Huang and Lin2018) consider a 10-fold CV for fitting mixture of Erlangs, where the averaged log-likelihood on the test sets is used as a score function to select the optimal number of components. CV can be implemented with the help of parallel computing in julia (see also section 4.5).

4.4. Pricing and reserving functions

Our package contains a collection of functions related to actuarial pricing, reserving and risk management, including calculation of mean, variance, VaR, CTE, LEV $E[(Y \wedge u)]$ and SL premium $E[(Y - d)_+]$ of the response variable. These functions start with root predict_, followed by appropriate quantities of interest (mean, var, VaR, CTE, limit, excess) and corresponding function arguments.

As an illustration, we consider three policyholders in Table 9, where A has no claim history, B has a medium-sized claim and C has a large claim. Below is the sample code to calculate different quantities of interest.

Actuarial pricing calculation can be done based on either individual loss distributions, or the aggregated loss distribution of the entire portfolio. Both can be achieved using built-in functions in our package.

Table 9. Selected policyholders in French auto insurance dataset.

On the individual level, the functions above can be called directly to calculate the pure premium E[Y], as well as the premium value in the presence of a policy deductible ($E[(Y - d)_+]$) or policy limit ($E[(Y \wedge u)]$). The first part of Table 10 summarises the premium calculation for Policyholders A, B and C, based on individual-level premium principles.

On the portfolio level, the sim_dataset function (see also section 4.6) will simulate one response value (i.e. claim amount) for each individual policyholder, which can be summed up to the aggregated portfolio loss under one possible scenario. Repeated simulation of the entire portfolio will produce the empirical distribution for the aggregated loss. The VaR and CTE of the aggregated loss can be obtained from the simulated sample, which are useful for setting the insurer’s total reserve. In addition, the VaR and CTE can be allocated back to policyholders as a loaded premium, according to some weighting scheme which reflects policyholders’ relative riskiness (e.g. relative magnitude of their pure premium). The second part of Table 10 summarises the premium calculation for Policyholders A, B and C, based on portfolio-level premium principles.

Table 10. Pricing calculation for selected policyholders in French Auto Insurance Dataset. For portfolio-level premium principles, the allocation is based on the relative size of pure premium, where the weight for policyholder i is $w_i = E[Y_i]/\sum_{j}E[Y_j]$. For each policyholder and premium principle, the calculation is done based on both prior probability (left column) and posterior probability (right column).

4.5. Parameter uncertainty

In addition to obtaining point estimates of model parameters, it is crucial to also calculate their confidence intervals in order to identify significant parameters. Given that the log-likelihood in equation (4) becomes much more complicated when data are truncated and/or censored, analytically deriving the variance and confidence intervals of parameter estimates would not be feasible, and could be subject to numerical issues in implementation.

Instead, we can use bootstrapping methods in Grün & Leisch (Reference Leisch2004). A straightforward nonparametric bootstrapping algorithm is described as follows:

1. Fit an LRMoE using the original dataset, and obtain an estimate $\hat{{\boldsymbol{\Phi}}}$.
2. For a fixed number of total iterations, say 200, sample with replacement from the original dataset, on which to fit an LRMoE again with the same component distributions.
3. The estimates $\hat{{\boldsymbol{\Phi}}}_1, \hat{{\boldsymbol{\Phi}}}_2, \dots, \hat{{\boldsymbol{\Phi}}}_{200}$ from the bootstrapped samples will provide an estimate of the variance of parameters.

The algorithm above can be easily implemented in parallel in julia as follows. We have included the estimated confidence intervals in the Appendix, calculated based on 200 bootstrapped samples.

4.6. Model simulation

In the LRMoE setting, the loss distributions of policyholders are mixtures of the same expert functions, but with potentially different mixing weights. Consequently, the distribution of the aggregate loss, as a sum of individual losses, usually do not admit a simple form.

Our package contains a dataset simulator, which helps with analysing the distribution of aggregated loss. Given a portfolio of policyholders X and a model specification alpha and comp_dist, the simulator will return one set of random realisation of losses for each policyholder.

With multiple calls to sim_dataset, an approximation of the aggregated loss can be obtained. This has applied in section 4.4 to calculate premium based on portfolio-level prinmium principles.

4.7. Model visualisation

After fitting and choosing an appropriate LRMoE model, the user can visualise it with in-package plotting functions, or create more customised plots using generic simulation functions combined with plotting utilities or other dedicated packages such as Plots.jl and StatsPlots.jl. In this subsection, we will use the six-component llllll model for demonstration.

Latent Class Probabilities

The logit regression in LRMoE assigns each policyholder into latent risk classes based on covariates. Given a fitted mode and a vector of covariates, the probability of latent classes can be computed and visualised using the built-in predict_class_prior and plot_class_prob functions.

Figure 2. Prior and posterior latent class probabilities for selected policyholders. Component 6 (Lognormal(6.58, 1.95)) corresponds to the tail, while Component 3 (Lognormal(6.95, 1.05)) corresponds to the largest spike in the dataset.

Consider the policyholders in Table 9. The predict_class_prior function will return both latent class probabilities (prob), as well as the most likely class (max_prob_idx). The following sample code calculates their latent class probabilties, where each row represents a policyholder and each column represents a latent class. The prior probabilities are plotted in Figure 2 based solely on covariates information, there is not a large difference between these policyholders, and all are most likely coming from the first latent class.

For policyholders with known claim history, it may be more informative to consider the posterior latent class probabilities by calling corresponding functions for posterior probabilities. The posterior probability of latent class j is given by $\texttt{P}\{ Z_{ij} = 1 | \hat{{\boldsymbol{\Phi}}}, {\textit{\textbf{X}}}, {\textit{\textbf{Y}}} \}$ for fitted parameters $\hat{{\boldsymbol{\Phi}}}$ which can be computed analogous to equation (6). Readers may also refer to section 6.3.2 of Fung et al. (Reference Fung, Badescu and Lin2019a) for more details.

Consider again the same policyholders in Table 9. The code to calculate posterior probabilities is similar as above. The posterior probabilities are also plotted in Figure 2. Given that Policyholders B and C have non-zero claims, a lot of latent class probability is shifted to Classes 3 and 6, which correspond to the middle and tail parts of the loss distribution (see also the largest spike in Figure 3). Meanwhile, Policyholder A has no claim, resulting in little change to the latent class probabilities since all components have very similar zero inflation.

Figure 3. Overall goodness of fit of positive claims in the French auto insurance dataset.

The posterior probabilities are also helpful for adjusting premium rates. In Table 10, we contrast premium calculation based on both prior and posterior probabilities. It is clear that Policyholder A is rewarded by a decrease in premium for having no claim history, while B and C have significant increase in premium rates.

Overall Goodness of Fit

The overall goodness of fit can be examined by contrasting the empirical histogram against fitted density curve and the Q–Q plot of empirical and fitted quantiles. These can be produced with the sim_dataset function included in our package, combined with basic plotting functions in julia. The corresponding plots are shown in Figure 3.

Covariate Influence

The partial dependence plot is commonly used in machine learning to investigate the influence of a particular covariate on the response, assuming independence amongst covariates (Friedman Reference Friedman2001). For example, the marginal effect of a covariate on the mean claim amount can be obtained using the following steps:

1. Fix the covariate at a particular value (say, car brand = “F”) for all policyholders, while other covariates remain unchanged;
2. Use predict_mean_prior function to compute mean response values by policyholder (see section 4.4);
3. Compute the grand mean of all values in step (2), which averages out the effect of other covariates and yields the mean response value when the car brand is of type “F”;
4. Repeat steps (1)–(3) for a range of values of the same covariate of interest, for example, varying car brand from “F” to “VAS”.

The procedure above can be generalised to continuous covariates and other quantities of interest, such as the quantile of the response variable. The covariate influence plot graphically illustrates how the characteristics of a policyholder are associated with a particular measure of the response variable. For example, plotting the mean (or VaR/CTE) of the response variable against car brand reflects how a policyholder’s overall riskiness (or tail risk) is related to the car brand.

Figure 4 shows the influence of covariates Brand, Region and Car Age, which may be interpreted as follows. For car brand, Mercedes, Chrysler or BMW (MCB), Opel, General Motors or Ford (OGF) and Volkswagen, Audi, Skoda or Seat (VAS) are generally riskier. Also, compared with other regions, policies issued in regions IF (Ile-de-France) and L (Limousin) may be considered riskier. In addition, older cars are generally associated with lower risks.

5. Summary and Outlook

This paper presented a new julia package LRMoE for actuarial loss modelling. In this first version, the package can cover the most basic need to fit an LRMoE model, as well as help the user visualise the fitted model and calculate insurance premium. There are several future developments in our plan, including:

• Model selection tools: As of the current version, model selection is done by AIC/BIC/CV, all of which require the user to choose and run a selection of model. Some automated model selection procedures may be potentially incorporated in the package (e.g. SCAD type penalty in Fan & Li, Reference Fan and Li2001 and Yin & Lin, Reference Yin and Lin2016).
• Feature selection tools: Datasets usually contain a large number of covariates, but not all are important for predicting the response. While plots of covariate influence may offer some intuition of their relative importance, it may be more insightful to quantify their influence and provide a function to automatically choose the most influential covariates.

Furthermore, it is crucial to compare the fitting and predictive performances between our proposed LRMoE and classical regression models. Several existing papers have already shown that Erlang Count LRMoE (EC-LRMoE) performs much better than Negative Binomial GLM (see section 6.1 (Table 7) of Fung et al. Reference Fung, Badescu and Lin2019a), and Transformed-gamma LRMoE (TG-LRMoE) performs much better than various severity GLM (see section 5.3.1 (Figure 2 and Table 2) of Fung et al. forthcoming). Nonetheless, it is still desirable to conduct extensive comparisons between LRMoE under different expert functions and a wider range of classical models (including, e.g. GAM), and this will be addressed in a new paper that we are currently working with.

Figure 4. Covariate influence: brand, region and car age.

Acknowledgements

The authors would like to thank the two anonymous referees for their valuable comments and suggestions. The authors acknowledge an academic research grant provided by the Canadian Institute of Actuaries. This work is also partly supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Spark C. Tseung and Tsz Chai Fung further acknowledge the financial support by the Edwin S.H. Leong Scholarship program and the SOA Hickman Scholars Program, respectively.

5. Appendix

The appendix contains the parameter estimates of model llllll presented in section 4, as well as their 95% confidence intervals. Significant parameters are marked in bold.

References

Blostein, M. & Miljkovic, T. (2019). On modeling left-truncated loss data using mixtures of distributions. Insurance: Mathematics and Economics, 85, 35–46. ISSN 0167-6687.Google Scholar

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.Google Scholar

Dutang, C. & Charpentier, A. (2019). CASdatasets: insurance datasets. R package version 1.0-10.Google Scholar

Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.CrossRef Google Scholar

Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.CrossRef Google Scholar

Fung, T.C., Badescu, A.L. & Lin, X.S. (2019a). A class of mixture of experts models for general insurance: application to correlated claim frequencies. ASTIN Bulletin, 49(3), 647–688.CrossRef Google Scholar

Fung, T.C., Badescu, A.L. & Lin, X.S. (2019b). A class of mixture of experts models for general insurance: theoretical developments. Insurance: Mathematics and Economics, 89, 111–127.Google Scholar

Fung, T.C., Badescu, A.L. & Lin, X.S. (2020a). Fitting censored and truncated regression data using the mixture of experts models. Available in SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3740061 CrossRef Google Scholar

Fung, T.C., Badescu, A.L. & Lin, X.S. (forthcoming). A new class of severity regression models with an application to IBNR prediction. North American Actuarial Journal. https://doi.org/10.1080/10920277.2020.1729813 CrossRef Google Scholar

Grün, B. & Leisch, F. (2004). Bootstrapping Finite Mixture Models. Mathematics.Google Scholar

Grün, B. & Leisch, F. (2008). FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.CrossRef Google Scholar

Gui, W., Huang, R. & Lin, X.S. (2018). Fitting the Erlang mixture model to data via a GEM-CMM algorithm. Journal of Computational and Applied Mathematics, 343, 189–205.CrossRef Google Scholar

Jiang, W. & Tanner, M.A. (1999). On the identifiability of mixtures-of-experts. Neural Networks, 12(9), 1253–1258.CrossRef Google Scholar PubMed

Jordan, M.I. & Jacobs, R.A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.CrossRef Google Scholar

Lee, D., Li, W.K. & Wong, T.S.T. (2012). Modeling insurance claims via a mixture exponential model combined with peaks-over-threshold approach. Insurance: Mathematics and Economics, 51(3), 538–550.Google Scholar

Leisch, F. (2004). FlexMix: a general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.CrossRef Google Scholar

McLachlan, G. & Peel, D. (2004). Finite Mixture Models. New York: John Wiley & Sons.Google Scholar

Meng, X.L. & Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80(2), 267–278. ISSN 0006-3444.CrossRef Google Scholar

Miljkovic, T. & Grün, B. (2016). Modeling loss data using mixtures of distributions. Insurance: Mathematics and Economics, 70, 387–396. ISSN 0167-6687.Google Scholar

Scollnik, D.P. & Sun, C. (2012). Modeling with Weibull-Pareto models. North American Actuarial Journal, 16(2), 260–272.CrossRef Google Scholar

Tseung, S.C., Badescu, A.L., Fung, T.C. & Lin, X.S. (2020). LRMoE: an R package for flexible actuarial loss modelling using mixture of experts regression model. Available in SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3740215 CrossRef Google Scholar

Yin, C. & Lin, X.S. (2016). Efficient estimation of Erlang mixtures using iSCAD penalty with insurance application. ASTIN Bulletin: The Journal of the IAA, 46(3), 779–799.CrossRef Google Scholar