1. Introduction
The ongoing improvements in human life expectancies over the past few decades have introduced outstanding challenges in predicting mortality scenarios. As plotted in Figure 1(a), for the Italian population, consistent mortality improvements are observed for ages 0–100 over the investigated period 1950–2016. This issue is concerning, since accurate forecasts are essential to the preparing of plans by governments, the designing of pension schemes and annuity products, and the reserving by insurance companies. Specifically, the longevity risk, such that mortality rates are underestimated, can oblige more-than-expected costs for life annuity providers and defined-benefit pension funds. This risk therefore may negatively influence the global longevity risk market for pension liabilities, the size of which is around 60–80 trillion USD as of 2018 (Blake et al., Reference Blake, Cairns, Dowd and Kessler2019).
To understand and reduce the risks related to mortality and longevity, mortality modeling and forecasting have become standard mitigation tools. Among existing methods, one popular stream is based on the seminal work of Lee and Carter (Reference Lee and Carter1992), which is widely known as the Lee–Carter (LC) model. Influential extensions on LC have been proposed in the actuarial literature. Popular models include the Renshaw–Haberman (RH) model (Renshaw and Haberman, Reference Renshaw and Haberman2006), functional demographic model, or HU model proposed in Hyndman and Ullah (Reference Hyndman and Ullah2007) and age-period-cohort (APC) model studied in Cairns et al. (Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009). Despite their popularity, a major drawback of LC and its extensions is the lack of age coherence in forecasting. In other words, even for adjacent ages, forecasts produced by the LC model may differ infinitely in the long run and are not biologically reasonable.
Existing literature has proposed a range of parametric specifications to resolve this issue. An influential work by Li and Lu (Reference Li and Lu2017) develops the spatial–temporal autoregressive (STAR) model. The desirable age coherence is realized by its established cointegration feature. Another novel attempt is the age-coherent extensions of the LC model using the hyperbolic (LC-H) and geometric (LC-G) convergences in the relative speeds of age-specific mortality decline (Gao and Shi, Reference Gao and Shi2021). A similar principal is adopted to the sparse VAR (SVAR) model by Li and Shi (Reference Li and Shi2021), and we name the SVAR model using hyperbolic and geometric convergences in the long-term mean by SVAR-H and SVAR-G, respectively. Despite the realized age coherence, drawbacks still inevitably exist for those models. For instance, the STAR model employs an ad hoc and thus restrictive sparsity structure to resolve the inherent high-dimensionality ( $p>>N$ ) issue of the mortality data (Guibert et al., Reference Guibert, Lopez and Piette2019). Same as in LC, LC-H and LC-G employ an inflexible temporal parametric approach (Li and Lu, Reference Li and Lu2017). SVAR-H and SVAR-G work with differenced log mortality rates (i.e., mortality improvements) that may lose information. Given the diversity, scarcity, and long-spanning nature of the mortality data, any individual model could be misspecified. Consequently, it is not realistic to expect that a single model would provide optimal forecasts in all cases (Bates and Granger, Reference Bates and Granger1969; Kleijn and Van Dijk, Reference Kleijn and Van Dijk2006; Genre et al., Reference Genre, Kenny, Meyler and Timmermann2013).
To overcome such an outstanding issue, this paper aims to employ the ensemble averaging, also known as the model averaging (MA) approach, to combine the coherent mortality models. MA is a sophisticated and well-developed approach in machine learning (Ley and Steel, Reference Ley and Steel2009; Amini and Parmeter, Reference Amini and Parmeter2012; Lessmann et al., Reference Lessmann, Sung, Johnson and Ma2012; Bork et al., Reference Bork, Møller and Pedersen2020; Bravo et al., Reference Bravo, Ayuso, Holzmann and Palmer2021), which has been widely applied in recent economics and finance research and practices (see, for example, Eicher et al., Reference Eicher, Papageorgiou and Raftery2011; Mirestean and Tsangarides, Reference Mirestean and Tsangarides2016; Shiraya and Takahashi, Reference Shiraya and Takahashi2019; Baechle et al., Reference Baechle, Huang, Agarwal, Behara and Goo2020; du Jardin, Reference du Jardin2021, among others). With respect to mortality data, Shang (Reference Shang2012) and Kontis et al. (Reference Kontis, Bennett, Mathers, Li, Foreman and Ezzati2017) have employed various MA strategies, such as the Bayesian model averaging (BMA). A recent study by Kessy et al. (Reference Kessy, Sherris, Villegas and Ziveyi2021) examines the stacked regression ensembles.
Among those studies, the same weight is assigned to all ages within an individual model to ease the computation and improve the stability in forecasting. However, mortality models usually behave differently dependent on the age groups. To see this, we model the Italian data over 1950–1992 and forecast rates spanning 1993–2006. The out-of-sample root of mean squared forecast errors (RMSFE) at each age is then produced by averaging errors over all forecasting steps. The results of the LC and its extensions are plotted in Figure 1(b). Clearly, LC and APC models are less favorable over young age groups and should be assigned lower weights. Consequently, age-specific weights are more desirable to achieve the optimal forecasting performance.
In this paper, we employ the classic global minimum variance portfolio (GMVP) strategy (Markowitz, Reference Markowitz1952) to choose the weights of the proposed MA approach. The implementation of GMVP without restriction, however, can be dangerous in mortality forecasting, due to the inherent issues of the GMVP solution. For one thing, GMVP ignores the parameter error, and using the in-sample variance may lead to an overfitting issue when forecasting mortality rates. For another, if a dominant model exists, GMVP solution may lead to a “winner-take-all” issue, such that most weights are attributed to this dominant model.
To combat against those issues, we employ two strategies. First, following the approach of Shi (Reference Shi2022a), the variance is minimized for out-of-sample forecast errors, rather than for in-sample residuals. It is well known that out-of-sample error, such as MSFE, consists of both process and parameter risks. Hence, the first issue of GMVP is well addressed via this strategy. Second, to realize the desirable diversity, we employ ten models in the ensemble space: STAR, LC-H, LC-G, SVAR-H, SVAR-G, LC, SVAR, APC, HU, and RH models. More importantly, the diversified age-specific weights are permitted to allow for dynamics as evidenced in Figure 1(b).
Further, two additional penalties are considered in our MA approach to employ strong and reasonable prior information of the mortality data. Specifically, as described above, long-run mortality forecasts should be age coherent. To achieve this, a coherent penalty is imposed to reduce weights of age-incoherent models. Moreover, due to the heterogeneity in out-of-sample forecasts among models, drastic changes of weights from one age to another may result in abrupt differences between forecast mortality rates. For instance, the forecast rate of age 50 in some year might be higher than that of age 51. In life insurance practice, this unreasonably suggests that younger policyholders will pay for higher premiums than older policyholders. To resolve this issue, we impose a smoothness penalty in the optimization to reduce abrupt changes in weights of the adjacent ages in the same model. Both penalties are then selected using a hold-out-sample strategy. In addition, a non-negativity constraint is applied to improve interpretability of weights.
Altogether, we consider eleven models in this paper: STAR, LC-H, LC-G, SVAR-H, SVAR-G, LC, SVAR, APC, HU, RH, and MA. The empirical data sourced from Human Mortality Database (2019) of ten European countries are examined. The ages are one-year groups of 0–100, and the timespan is 1950–2016. We first present the baseline results of ten-step-ahead forecasts, using the popular performance measure RMSFE over both ages and years. Three sets of sensitivity analyses are also conducted, by altering individually the forecasting step, temporal coverage, and age groups. A comparison of the proposed MA approach with the influential model confidence set (MCS) proposed in Hansen et al. (Reference Hansen, Lunde and Nason2011) is further conducted. To systematically demonstrate its usefulness in economic and financial practices, the proposed MA approach is employed to illustrate fixed-term annuities pricing in a separate case study.
The contributions of this paper are fivefold. First, this study is among the first to comprehensively consider both age-coherent and non-age-coherent specifications and adopt age-specific weights in mortality forecasting using the ensemble averaging strategy. The adopted objective function effectively achieves the benefit of GMVP solution. The diversity and overfitting issues of GMVP are simultaneously addressed, via working with out-of-sample forecast errors and imposing two penalties to honor the reliable prior information of mortality data. The estimated weights are therefore optimal by balancing the randomness of small sample sizes, consistency of estimation, and adopting useful prior information. Essentially, our MA method is a supervised machine learning technique. Compared to other techniques, such as the neural networks, the MA approach is more transparent and does not suffer the “black-box” issue. Second, the proposed MA method provides asymptotically age-coherent forecasts. This is both theoretically verified and empirically visualized in the long-term forecast of our case study. Thus, the desirable age coherence feature in the ensemble is not lost when the MA is executed, even when age-incoherent models are included in the ensemble. Third, our empirical results demonstrate that the proposed MA method can significantly improve out-of-sample forecasting performance of all individual models in the ensemble. According to the RMSFE, MA ranks the first for eight out of ten populations. This result is much robust in all sets of sensitivity analyses. Fourth, we consider other modeling selection/averaging approaches, such as the MCS and BMA. The outperformance of our proposed approach is demonstrated, and the validity of the proposed estimation approach is illustrated by presenting the estimated age-specific weights. Fifth, a case study is provided to demonstrate the effectiveness of the MA method in economic and financial practice. The narrower width of prediction intervals suggests that MA can be more efficient in measuring uncertainties in mortality forecasting. This is critical to practices like the annuity pricing, and an illustration is thoroughly conducted over different starting ages and maturity terms. Thus, the proposed MA approach can be a widely useful tool in forecasting mortality data for actuarial products. Implications on other actuarial modeling practices are also briefly outlined.
The rest of this paper is organized as follows. In Section 2, we review the specification and features of the LC model and briefly outline its extensions. Five coherent alternative models are explained in Section 3. The MA approach is proposed in Section 4. We conduct empirical studies with robustness checks and additional analysis in Section 5. Section 6 presents a case study, and Section 7 concludes.
2. The Lee–Carter model
Among the existing models, the seminal work proposed in Lee and Carter (Reference Lee and Carter1992), known as the Lee–Carter (LC) model, is one of the most popular approaches. Essentially, LC belongs to the family of factor models and decompose mortality rates into age-dependent and time-series factors. Specifically, the logged central mortality rate of age x in year t, or $\ln m_{x,t}$ , is specified as follows:
where $x \in (1,\ldots,N)$ and $t \in (1,\ldots,T)$ . In terms of the factors, $a_x$ is the average age trend of $\ln m_{x,t}$ across t, $b_x$ is the relative speed of decline in $\ln m_{x,t}$ at each x, and $k_t$ is a time-dependent index at each t. Additionally, $\varepsilon_{x,t}$ is the residual at age x and time t. It is usually assumed that $\varepsilon_{x,t}$ follows a multi-Gaussian distribution with zero means and independence on the temporal dimension.
Regarding the estimation of LC, the singular value decomposition (SVD) is the most frequently employed technique. The detailed procedure to implement the SVD is described in Trefethen and Bau (Reference Trefethen and Bau1997). It is worth noting that there are nonunique solutions to produce $\hat{b}_x$ and $\hat{k}_t$ via the SVD. Necessary constraints are therefore imposed to obtain unique estimates, such that $\sum_{x=1}^N\hat{b}_x=1$ and $\sum_{t=1}^T\hat{k}_t=0$ .
Remark 1. Much of the popularity of LC estimated by SVD is its self-explanatory parameters. Intuitively, $a_x$ describes the overall, or “average,” pattern of the historical logged mortality rates on the age dimension. $b_x$ and $k_t$ are orthogonal norms, which represent the relative dynamics of $\ln m_{x,t}$ on the age and temporal dimension, respectively.
To forecast the future logged mortality rates, or $\ln \hat{m}_{x,T+h}$ , $\hat{a}_x$ and $\hat{b}_x$ , are kept constant in Lee and Carter (Reference Lee and Carter1992). The temporal factor $\hat{k}_t$ , however, is intrinsically viewed as a random walk with drift process, such that
where $d_k$ is the average change in $\hat{k}_t$ , and $e_t$ are independently and identically distributed (iid) Gaussian sequences with zero means. Using (2.2) and the property of a random walk process, the h-step-ahead forecast $\hat{k}_{T+h}$ is produced as $\hat{k}_{T}+hd_k$ $\forall h\ge1$ . Thus, the h-step-ahead forecast of logged mortality rate can be obtained as follows.
Based on the LC model, many extensions using the factor framework are proposed. For instance, Renshaw and Haberman (Reference Renshaw and Haberman2006) include the cohort effect in the LC specification, which considers stochastic features of people born at different time periods (denoted as the RH model). An age-period-cohort (APC) model is derived from RH by constraining the age-specific loadings of temporal and cohort effects (Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009). Hyndman and Ullah (Reference Hyndman and Ullah2007) extend the LC by introducing more principal components in the decomposition of age and temporal effects, which are fitted via a functional modeling approach (denoted as the HU model).
3. Coherent mortality models
One of the major drawbacks of the LC model is the lack of age coherence in forecasts (see, for instance, Gao and Shi, Reference Gao and Shi2021, among others). The same issue also applies to its popular extensions, including RH, APC, and HU models. The concept of age coherence is first proposed in the influential work of Li and Lu (Reference Li and Lu2017), which ensures biologically reasonable forecasts of mortality rates in the long run. Intuitively, age coherence means that predicted mortality rates (of the same population) across ages should not differ indefinitely. Consistent with Li and Lu (Reference Li and Lu2017), Gao and Shi (Reference Gao and Shi2021) and Li and Shi (Reference Li and Shi2021), we formally define age coherence as follows.
Definition 1. Age coherence means that for the h-step-ahead forecasts, $|\ln \hat{m}_{i,T+h}-\ln \hat{m}_{j,T+h}|=O_p(1)$ , $\forall i,j \in (1,\ldots,N)$ . For the concept of $O_p(\cdot)$ , given a set of random variables $X_n$ and a corresponding set of constants $a_n$ , if $X_n=O_p(a_n)$ , then for any $\varepsilon > 0$ , there exists a finite $M > 0$ and a finite $N > 0$ such that $P(|X_n/a_n|\ge M)\le \varepsilon$ $\forall n>N$ . That is, when $h \rightarrow \infty$ , $|\ln \hat{m}_{i,T+h}-\ln \hat{m}_{j,T+h}|$ will not diverge to infinity.
Remark 2. Note that age coherence can also be defined using the original mortality scale, such that for the h-step-ahead forecasts, we have $|\hat{m}_{i,T+h}/\hat{m}_{j,T+h}|=O_p(1)$ , $\forall i,j \in (1,\ldots,N)$ . To cope with the original scale, a Poisson distribution (assumed for death counts) associated with a log-linear regression is often employed, and the risk exposure (population counts) is usually employed as the offset factor (Brouhns et al., Reference Brouhns, Denuit and Vermunt2002).
To address the lack of age coherence in the LC model, we describe five recently proposed alternatives in this section. Among them, two are based on the LC framework, whereas three are developed from the vector-autoregressive (VAR) model.
3.1. The spatial–temporal autoregressive (STAR) model
Despite its popularity, a general VAR model cannot be directly fitted to a typical mortality data set for two reasons. First, a VAR model requires a stationary response variable, while mortality rates over time are clearly trending and therefore nonstationary. Second, the unknown parameters (p) exceed the observations (T) in a nonconstrained VAR framework. For instance, with an intermediate number of age groups (N), say 50, the $p = 50\times51 >> T$ issue will arise, given that observations are available only for a few dozens of years.
To resolve those issues, Li and Lu (Reference Li and Lu2017) propose the spatial–temporal autoregressive (STAR) model to study and forecast mortality data. On the temporal dimension, it considers the Granger causality and cointegration to resolve the stationarity problem. On the age dimension, the STAR model adopts the sparse spatial information to reduce the dimensionality of p with the follow specification:
where we let $y_{i,t}=\ln m_{i,t}$ for simplicity, $i=3,4,\ldots,N$ , and $t=1,2,\ldots,T$ . The residual sequence $\epsilon_{i,t}$ is similarly assumed as in the LC model.Footnote 1
Rewriting (3.1) in a VAR(1) form, we have that
where $\boldsymbol{{y}}_t$ = $(y_{1,t},y_{2,t},\ldots,y_{N,t})^\prime$ , $\boldsymbol{\alpha}$ = $(\alpha_{1},\alpha_{2},\ldots,\alpha_{N})^\prime$ , $\boldsymbol{\epsilon}_t$ = $(\epsilon_{1,t},\epsilon_{2,t},\ldots,\epsilon_{N,t})^\prime$ , and
To ensure invertibility and interpretability, it is required that $0<\beta_{i,i-2}<1$ , $0<\beta_{i,i-1}<1$ , and $0<\beta_{i,i-2}+\beta_{i,i-1}<1$ for all $i>2$ . When $i=2$ , we need that $0<\beta_{2,1}<1$ .
The interpretation of the STAR model parameters is straightforward. For age $i>2$ , $(1-\beta_{i,i-2}-\beta_{i,i-1})$ is the usual temporal effect of the first time lag $y_{i,t-1}$ on $y_{i,t}$ . As for the mortality practice, cohort effects are more important, which are measured by $\beta_{i,i-1}$ (the same cohort) and $\beta_{i,i-2}$ (the younger neighboring cohort). The relevant positive constraints not only ensure the invertibility (and thus stationarity) of $y_{i+1,t}-y_{i,t}$ but also the interpretability of temporal and cohort effects. A negative measure is usually unexplainable, and nor it is revealed in the existing literature.
In addition to the interpretability, the specification described in (3.2) and (3.3) has attractive statistical features. As shown in Li and Lu (Reference Li and Lu2017), with the assumption that all $y_{i,t}$ are I(1) (commonly used in mortality research), all neighboring age pairs $y_{i,t}$ and $y_{i+1,t}$ are cointegrated with order $(\!-\!1,1)$ . This approach successfully resolves the stationarity issue and results in age coherence. More specifically, cointegration suggests that $y_{i+1,t}-y_{i,t}$ is stationary, which leads to a constant long-run mean. Further, the total number of parameters is largely reduced from $N^2+N$ to $3N-3$ , which is feasible to solve under a general penalized least squares framework with closed-form solutions.
Remark 3. To reduce the randomness of estimates owing to limited data availability, Li and Lu (2017) conduct the estimation in a penalized least squares fashion, which imposes a Tikhonov ( $L_2$ ) regularization in the loss function. In other words, three smoothing penalties (corresponding to $\alpha_i$ , $\beta_{i,i-1}$ and $\beta_{i,i-2}$ , respectively, for age i) need to be preselected before a solution can be derived. Essentially, this assumes that the coefficients across ages should change smoothly. See Section 3.4 for how those penalties are selected.
The forecasting of STAR is therefore performed in an iterative fashion, where
and $h>1$ . Due to the established co-integration feature, $\hat{y}_{i,t+h}-\hat{y}_{j,t+h}$ is stationary for all $i,j=1,\ldots,N$ and $i\ne j$ , and $h\ge 1$ . This then straightforwardly ensures the age coherence in forecasting.
3.2. Age-coherent extensions of the LC model
Other than working with the VAR framework, a recent study by Gao and Shi (Reference Gao and Shi2021) proposes two effective age-coherent extensions of the LC model. Both are motivated by Li et al. (Reference Li, Lee and Gerland2013) to rotate $\hat{b}_{x,h}$ gradually over time. The rotation is supported by the fact that mortality decline decelerates at younger ages and accelerates at old ages (Li et al., Reference Li, Lee and Gerland2013). Also, to realize age coherence in the long run, Gao and Shi (Reference Gao and Shi2021) require that $\hat{b}_{x,h}$ eventually converge (rotate) to a constant $1/N$ for all ages (a flat line). The rationale of $1/N$ comes from the constraint that $\sum_{x=1}^N\hat{b}_{x,h}=1$ when $h \ge 1$ and $\hat{b}_{i,h}=\hat{b}_{j,h}$ when $h \rightarrow \infty$ .
Specifically, the first extension adopts the autoregressive (AR) framework. Under the AR(1) scenario, the time-varying $\hat{b}_{x,h}$ (represented by $\hat{b}^G_{x,h}$ ) is specified below.
where $h\ge 1$ and $\hat{b}^G_{x,0}=\hat{b}_{x}$ . To ensure the stationarity, the coefficient $r^l_x$ strictly falls between 0 and 1. In such a case, $\hat{b}^G_{x,h}$ converges to the long-run mean $1/N$ geometrically. For a larger (smaller) $r^l_x$ , the speed of convergence is slower (faster).
The second extension employs the hyperbolic decay, which may achieve the final convergence slower than the AR specification. This concept is often refer to as long memory in time-series analysis for economic and financial practices (see, for example, Smallwood and Norrbin, Reference Smallwood and Norrbin2006; Ho and Shi, Reference Ho and Shi2020, among others). When assumed to decay hyperbolically, the time-varying $\hat{b}_{x,h}$ (represented by $\hat{b}^H_{x,h}$ ) is described by
where
The hyperbolic parameter, or $d^l_x$ , needs to fall between 0 and 1 for a stationary rotation. In this case, since $\delta_{h} (d^l_x) \rightarrow 0$ when $h \rightarrow \infty$ , $\hat{b}^H_{x,h}$ will eventually converge to $1/N$ and thus leads to the age-coherent forecasts. The speed of decay is slower (faster) for larger (smaller) $d^l_x$ .
Note that both $r^l_x$ and $d^l_x$ are age dependent, and their estimation is not a trivial issue due to the large sample size on the age dimension. Gao and Shi (Reference Gao and Shi2021) employ the inversed Epanechnikov kernel to reduce the complexity of estimation. More specifically, let $\tau$ defined as a scaled index $x/N$ with $x \in \{1,\ldots,N\}$ , when evaluated at N, the inversed Epanechnikov kernel is determined by $1- K_{b^w}(\tau-1)=1-K(\frac{\tau-1}{{b^w}})$ , where $K(\tau-1)=0.75(1-(\tau-1)^2)_+$ . The parameter ${b^w}$ is the corresponding kernel bandwidth and constrained to fall in the range of (0, 1]. Hence, the parametric structures of $r^l_x$ and $d^l_x$ are simplified to follow the pattern of $1-K_{b^w}(\tau-1)$ across ages.
Remark 4. The rationale of employing the inversed Epanechnikov kernel is thoroughly discussed in Gao and Shi (2021). In short, the bandwidth determines a “cutoff” age, and the kernel describes a pattern consistent with the sense that mortality decline is more difficult for older ages (Li et al., Reference Li, Lee and Gerland2013), starting from this cutoff. For instance, with ages 0–100 and a bandwidth of 0.7, all ages younger than 70 will share the same mortality decline speed. For ages older than 70, this decline is more difficult for an age further away from 70. Essentially, only two parameters are needed when (3.5) or (3.6) is employed: the decay parameter of the first age ( $r^l_1$ or $d^l_x$ ) and the kernel bandwidth ${b^w}$ . See Section 3.4 for how those parameters are selected.
Once $r^l_x$ and $d^l_x$ become available, the forecast logged mortality rates of the LC extension with $\hat{b}^G_{x,h}$ (LC-G) and that with $\hat{b}^H_{x,h}$ (LC-H) are
where $\hat{a}_x$ and $\hat{k}_t$ are identical to those of the original LC model. In other words, the in-sample fits of LC-G and LC-H are the same as those described in (2.1). Recall that when $h \rightarrow \infty$ , $\hat{b}^G_{x,h}$ or $\hat{b}^H_{x,h}$ converges to $1/N$ . Thus, following (3.7), using either LC-G or LC-H, forecast rates of different ages will not differ infinitely and are thus age coherent.
3.3. The sparse VAR (SVAR) model and coherent extensions
Despite its attractive features, STAR’s parametric structure as of (3.3) is ad hoc and relatively inflexible. Specifically, the effects of all cohorts other than the same and next younger one are forced to be zero. This may limit a comprehensive analysis of the effects of all possible cohorts.
To address this, Guibert et al. (Reference Guibert, Lopez and Piette2019) recently work on the mortality improvements denoted by $\triangle y_{i,t}=y_{i,t}-y_{i,t-1}$ and propose a sparse VAR (SVAR) model. Since $y_{i,t}$ is assumed I(1), the differentiation resolves the stationarity issue of a VAR system. Moreover, the SVAR model adopts a pure data-driven method to select nonzero coefficients via the ENET penalty estimation, rather than working with the ad hoc structure as in (3.3). The SVAR model is specified below.
where $\triangle{\boldsymbol{{y}}}_t$ = $(\triangle y_{1,t},\triangle y_{2,t},\ldots,\triangle y_{N,t})^\prime$ , V is the preselected AR lags and the sparsity of $\boldsymbol{{B}}_v$ is determined by the ENET penalty without any constraints. The forecast is performed similarly to (3.4) as a usual VAR-type model for $\triangle{\boldsymbol{{y}}}_t$ , and the forecast for the original logged mortality rate is computed as $\widehat{\boldsymbol{{y}}}_{t+h} = \boldsymbol{{y}}_{t} + \sum_{l=1}^{h} \triangle{\widehat{\boldsymbol{{y}}}}_{t+l}$ .
However, the desirable feature of age coherence is lost when working with mortality improvements directly, as in the SVAR model. That is, as long as the estimated long-term mean of $\triangle y_{i+1,t}-\triangle y_{i,t}$ (denoted by $\triangle\hat{\mu}_{i+1}$ ) is not $O_p(1/h)$ , the long-run forecast ${\widehat{y}_{i+1,t+h}}-{\widehat{y}_{i,t+h}}={y_{i+1,t}}-{y_{i,t}}+h\triangle\hat{\mu}_{i+1}$ will still grow to reach infinity, when $h \rightarrow \infty$ .
To produce age-coherent forecasts, a recent study of Li and Shi (Reference Li and Shi2021) extends the approach of Gao and Shi (Reference Gao and Shi2021) to the SVAR framework. In particular, the intercept $\hat{\alpha}_{x}$ is allowed to be time-varying for age x, and Li and Shi (Reference Li and Shi2021) adopt the hyperbolic decay as follows:
where $\hat{\alpha}_x$ is the estimate of a usual SVAR model as of (3.8), and $\hat{\alpha}^*_x$ is the long-term mean of $\hat{\alpha}_x$ . This specification is named SVAR with hyperbolic decay, or SVAR-H model. A similar specification can then be derived following the geometric decay as in (3.5):
which is named SVAR with geometric decay, or SVAR-G model. The age-dependent $r^s_x$ and $d^s_x$ can be specified using the inversed Epanechnikov kernel, as for LC-G and LC-H models.
Forecasts of $\triangle{\widehat{\boldsymbol{{y}}}}_{T+h}$ using the SVAR-G or SVAR-H model are the same as in (3.4). Logged mortality rates can then be forecast by
Remark 5. Since $\triangle y_{x,t}$ is stationary, $\hat{y}_{x,T+h}$ will approach $y_{x,T}+ h \hat{\mu}_{x}$ in the long run, where $\hat{\mu}_{x}$ is the estimated mean of $\triangle y_{x,t}$ . In a matrix form, it is easy to see that
Further, recall that when $h \rightarrow \infty$ , $\hat{\alpha}^G_{x,T+h}$ or $\hat{\alpha}^H_{x,T+h}$ converges to $\hat{\alpha}^*_x$ . To achieve age coherence, $\hat{\alpha}^*_x$ is obtained such that all elements of $\boldsymbol{\mu}$ are identical. In practice, estimates of $\hat{\alpha}^*_x$ are produced as follows:
where $\hat{\boldsymbol{{B}}}_v$ are in-sample estimates of a usual SVAR model, and $\hat{\boldsymbol\mu}^*$ consists of identical elements of $\hat{\mu}^*$ , which can be estimated as the average of $\hat{\mu}_x$ over all ages. $\hat{\mu}_x$ is obtained from the in-sample estimates of a usual SVAR model with (3.9).
Note that there are three tuning parameters in both SVAR-H and SVAR-G models: the sparsity penalty parameter ( $\lambda_s$ ), the hyperbolic or geometric decay parameter of the first age group ( $d^s_1$ or $r^s_1$ ), and the kernel bandwidth ( ${b^w}$ ). See Section 3.4 for how those parameters are selected.
3.4. Tuning parameters selection
All the five age-coherent mortality models described in this section require some preselected tuning parameters, either before the model is fitted (STAR, SVAR-H, and SVAR-G) or at the forecasting stage (LC-H, LC-G, SVAR-H, and SVAR-G). Due to the time-series nature, the usual cross-validation technique to select tuning parameters is not directly applicable.
In practice, a popular strategy for time-series data is to adopt the expanding-window approach explained in Hyndman and Athanasopoulos (Reference Hyndman and Athanasopoulos2018) to collect short-term out-of-sample forecasts. However, since age coherence is a long-term property, we follow Gao and Shi (Reference Gao and Shi2021) and adopt the hold-out-sample approach to select tuning parameters in all cases. Specifically, the selection aims to minimize
where RMSFE is the root of mean squared forecasting errors, and the evaluation period is given by the last fourth ( $[3T/4+1,T]$ ) of the data in our study. A high-level summary of the selection procedure is listed in Table 1 of Supplement Material for each model.
4. The proposed ensemble averaging approach
Despite the achieved desirable age coherence, among the five investigated models in Section 3, it is usually difficult to pick up one single model that uniformly outperforms the rest. This may be attributed to the fact that each model has its merits and drawbacks. For instance, the STAR model does not consider the empirically sensible decline in decaying speed for older ages. In contrast, the cohort effect may be “ignored” in extensions of LC and SVAR. Mortality data are often observed for a long time span and over countries/regions. The cross-sectional (temporal) heterogeneity may favour certain assumptions over some populations (during some periods). In addition, despite the long-term issue of age incoherence, influential models, such as LC, SVAR, APC, HU, and RH, may demonstrate favorable performance for certain age groups/populations in short term. Consequently, no single model may be well specified to capture the features and dynamics in mortality modeling entirely.
Among the existing literature, ensemble averaging or MA is an effective strategy to combat against such uncertainty (see, for example, Ley and Steel, Reference Ley and Steel2009; Amini and Parmeter, Reference Amini and Parmeter2012; Lessmann et al., Reference Lessmann, Sung, Johnson and Ma2012; Bork et al., Reference Bork, Møller and Pedersen2020; Bravo et al., Reference Bravo, Ayuso, Holzmann and Palmer2021, among others). In this section, we propose an effective MA strategy using the variance-optimization-weights to realize age-coherent forecasts. Technical details are also provided.
4.1. Background: A solution from global minimum variance portfolio
In his seminal work, Markowitz (Reference Markowitz1952) discusses a solution to achieve global minimum variance for a portfolio, or the GMVP solution. The idea can be straightforwardly applied to forecasting error minimization using an MA model. Suppose that there exist a finite number ( $J>2$ ) of models, the forecasting error of an MA approach is
where $e_{j,t}$ is the forecast error at time t for model j ( $j \in \{1,2,\ldots,J\}$ ), $w_j$ is the assigned weight for the jth model and ${e}_{M,t}$ is then the weighted summation of forecast errors of all J models, or the forecast error of the MA approach. Denote that $\boldsymbol{{w}}=(w_1,\ldots,w_J)^\prime$ , the GMVP solution of the optimal ${\boldsymbol{{w}}}^*$ is
where $\boldsymbol{\Sigma}_e$ is a $J \times J$ variance-covariance matrix of forecast errors, and $\boldsymbol\iota$ is a $J \times 1$ vector of ones.
Remark 6. Note that when forecast errors of some models are strongly (and positively) correlated, it is possible to have negative weights in ${\boldsymbol{{w}}}^*$ . Under a financial portfolio optimization scenario, this means a short position should be held for some assets. In the mortality case, although negative weights are sensible from a model combination perspective, their interpretation is not as straightforward. Also, negative weights might cause instability in long-term forecasting. To avoid this, the non-negative constraints should be added to the optimization issue that is usually faced in a GMVP problem:
This is a quadratic optimization issue with equality and inequality constraints, which can be solved via iteration-based algorithms (Gill et al., Reference Gill, Murray and Wright2019).
4.2. An MA approach for mortality forecasting
For an MA model used in mortality forecasting, the most essential issue is the objective function to be employed in the optimization. This is because that for the original GMVP solution, there are two major issues. First, since only the global risk is considered, GMVP may result in large weights in few models. This is also known as the “winner-take-all” problem and may negatively affect diversity in an MA approach. Second, the estimate is highly sensitive to parameter estimation errors of the covariance matrix. Consequently, an inappropriate covariance error matrix may lead to unrobust weights in an MA model.
In this paper, diversity is ensured in two ways. First, we employ both age-coherent and age-incoherent models, which include a total of ten specifications: STAR, LC-H, LC-G, SVAR-H, SVAR-H, LC, SVAR, APC, HU, and RH. Second, different from a single weight for all ages used in existing studies (see, for example, Shang, Reference Shang2012; Kontis et al., Reference Kontis, Bennett, Mathers, Li, Foreman and Ezzati2017; Kessy et al., Reference Kessy, Sherris, Villegas and Ziveyi2021, among others), age-specific weights are employed in all cases. As preliminarily evidenced in Figure 1(b), not all models can perform uniformly well for all age groups. This setting then effectively prevents the existence of one or few dominant models.
To avoid the potential overfitting issue of in-sample errors, we construct $\Sigma_{e}$ using the out-of-sample errors only. Specifically, the objective function will consist of out-of-sample mean squared forecasting errors (MSFE). Since it is well known that MSFE incorporates both process risk and parameter risk, this will effectively address the second problem of GMVP. In addition, we adopt two pieces of prior information in the optimization to further improve reasonableness in mortality forecasting. First, as argued above, the MA approach should (asymptotically) result in age-coherent forecasts. Therefore, the weights assigned to age-incoherent models should be asymptotically approaching zero. A penalty is imposed in the objective function for this aim. Second, mortality rates across neighboring ages tend to demonstrate similar dynamics. However, without further constraints, abrupt changes of forecast rates will be unavoidable, even for neighboring ages, due to potential randomness introduced by small samples. Consequently, inspired by existing studies such as Li and Lu (Reference Li and Lu2017), a smoothness penalty is additionally introduced to the objective function.
The optimization problem to choose optimal weights of an MA mortality model is then stated below. Note that for a mortality data set with N age groups estimated by J models (with the total $J_c$ age-coherent models ordered first and the rest $J_n$ age-incoherent models ordered last), the weight vector to be estimated is ${\boldsymbol{{w}}}=(\boldsymbol{{w}}^\prime_1,\ldots,\boldsymbol{{w}}^\prime_N)^\prime$ (the dimensionality is $(NJ)\times 1$ ), with $\boldsymbol{{w}}_x=(w_{x,1},\ldots,w_{x,J})^\prime$ . To further allow for the age coherence penalty, smoothness penalty and non-negative weights, we have the following optimization problem:
where $\boldsymbol{{\Omega}}_{e}$ is an $(NJ)\times(NJ)$ matrix with $\boldsymbol{\Sigma}_{e_x}$ on the diagonal blocks and elsewhere 0. $\boldsymbol{{C}}_1$ and $\boldsymbol{{C}}_2$ are both $(NJ)\times(NJ)$ ancillary matrices to specify the quadratic age coherence penalty and smoothness penalty, respectively. $\lambda_1$ and $\lambda_2$ are the associated tuning parameters. $\boldsymbol{\mathcal{I}}$ is another $(NJ)\times(NJ)$ ancillary matrix to impose the equality constraint for each age. Specifically, $\boldsymbol{{C}}_1$ is a diagonal matrix with the following diagonal items:
The xjth column of $\boldsymbol{{C}}_2$ is
That is, there are $(N-1)$ (−1)’s and one 1 in each column, with all other elements being 0. An example is provided below to illustrate the specification when $N=2$ , $J_c=1$ and $J_n=1$ . Also,
where $\boldsymbol{\iota}$ is a $J \times 1$ vector consisting of 1, as defined previously. The $\boldsymbol{{w}}^*$ can then be obtained using a standard quadratic programming algorithm, with preselected $\lambda_1$ and $\lambda_2$ (Goldfarb and Idnani, Reference Goldfarb and Idnani1983). Note that the coherence penalty term is
which reduces the weights for the $J_n$ age-incoherent models with the influence of penalty $\lambda_1$ , and the smoothness penalty term is
which increases the smoothness of assigned weights for the same model between adjacent ages for a larger $\lambda_2$ .
Example 1. Consider a simple case with $N=2$ ages, $J_c=1$ and $J_n=1$ models, we have that
The optimization problem stated in (4.1) then reduces to
With a larger (smaller) $\lambda_1$ , weights of the second model (assumed age-incoherent) will be smaller (larger). Similarly, with a larger (smaller) $\lambda_2$ , the $\hat{w}_{1,1}$ and $\hat{w}_{1,2}$ will change to $\hat{w}_{2,1}$ and $\hat{w}_{2,2}$ more (less) smoothly.
Denote that the h-step-ahead forecast of the age x’s logged mortality rate by the jth model as $\hat{y}_{x,j,T+h}$ , the forecast of MA is therefore
We now demonstrate that this forecast also achieves the desirable age coherence in an asymptotic fashion.
Theorem 1. Assume that the estimated long-run mean of mortality improvements is not increasing with h for all models considered in the MA approach, those estimators of age-coherent models are all asymptotically consistent, h and T go to infinity at the same rate, and the penalty $\lambda_1$ goes large with $T^2$ . Forecasts of the MA approach using estimated weights as the solution to (4.1) are asymptotically age coherent.
Proof. See Supplement Materials A.
Finally, the selection of $\lambda_1$ and $\lambda_2$ can still be performed using the hold-out-sample strategy as explained in Section 3.4. The detailed steps to forecast logged mortality rates using the proposed MA approach are listed below.
-
1. With the training period $[1,\ldots,T/2]$ , fit each of the J models that are included in the ensemble;
-
2. Collect out-of-sample forecast errors over $[T/2+1,\ldots,3T/4]$ , and calculate the sample estimate $\widehat{\boldsymbol{\Sigma}}_{e}$ ;
-
3. Perform a grid search of $\lambda_1$ and $\lambda_2$ : for each candidate, obtain estimates of weights as in (4.1) using $\widehat{\boldsymbol{\Sigma}}_{e}$ obtained in Step 2, fit all individual models over the full training sample $[1,\ldots,3T/4]$ , produce the forecasts over the test period $[3T/4+1,\ldots,T]$ , and calculate the corresponding RMSFE of the MA model using the estimated weights and forecasts of individual models as in (3.10);
-
4. Select the optimal $\lambda_1$ and $\lambda_2$ as that minimize the RMSFE;
-
5. Repeat steps 1-2 to obtain final estimated weights, with selected tuning parameters over training period $[1,\ldots,3T/4]$ and $\widehat{\boldsymbol{\Sigma}}_{e}$ estimated with out-of-sample forecast errors over $[3T/4+1,\ldots,T]$ ; and
-
6. Conduct the out-of-sample forecasting ( $[T+1,\ldots,T+h]$ ) with each of the J models, and the forecast rate of MA is their weighted summation using weights estimated at step 5.
To study the uncertainties of the forecast, we may follow the usual bootstrap method as explained in Chang and Shi (Reference Chang and Shi2021). Simply speaking, in-sample errors can be bootstrapped to produce replicates, which are then fitted following the procedure explained above to forecast new rates. The 2.5th and 97.5th percentiles out of those forecasts can then construct the 95th prediction interval (PI).
Remark 7. The advantages of the proposed MA approach are fourfold. First, different from a “naive” simple average approach, the weights are selected in a way that follows the spirit of GMVP and thus honors the diversity benefits. Second, the employed objective function effectively resolves two major issues of the GMVP solution. In particular, as forecast errors consist of $y_{x,T+h}-\hat{y}_{x,j,T+h}$ , the squared loss (variance) to be minimized is composed of out-of-sample MSFE, which considers the parameter risk that GMVP neglects. In addition, the imposed penalties ensure asymptotic age coherence in the long run and simultaneously reduce the risk of overfitting in the short run. Specifically, the employed hold-out-sample strategy to choose a smoothness penalty could effectively eliminate the undesirable abrupt changes in forecasts rates (a signal of overfitting) from one age to another. This may also improve the consistency of weights estimation, since potential biasness can be effectively reduced by minimizing RMSFE over the test sample. Thus, a biasness adjustment factor is not needed in (4.1). Third, the imposed non-negativity constraint improves the interpretability. In short, the proposed MA approach considers both the in-sample fitting and out-of-sample forecasting accuracy, as well as the interpretability issue. Also, the modeling mechanic of our approach is transparent. As a supervised learning technique, this may be considered an advantage of our approach over other competitive but more “black-box” machine learning models, such as the neutral networks. Fourth, the curse of high dimensionality will be avoided by working with mortality rates of individual ages. To see this, although $\boldsymbol{{\Omega}}_{e}$ in (4.1) is high-dimensional, it is a sparse matrix with only nonzeros on the diagonal blocks. Despite the large value of N, each $\boldsymbol{\Sigma}_{e_x}$ only considers J models ( $J \times J$ ). Since it usually holds that $J << T$ , $\boldsymbol{\Sigma}_{e_x}$ can be reliably estimated.
5. Empirical out-of-sample forecasting results
In this study, we focus on the mortality data of ten European countries investigated in the seminal work of Li and Lee (Reference Li and Lee2005): Austria, Denmark, the United Kingdom (UK), Finland, France, Italy, Netherlands, Norway, Spain, and Switzerland. All data are obtained from the Human Mortality Database (2019). Following Booth et al. (Reference Booth, Hyndman, Tickle and De Jong2006), we choose an opportune range of data, 1950–2016, to have a reliable, complete data set. Similarly, ages from 0 to 110 (the upper limit in Human Mortality Database, 2019) years are included in the sample, where data at older ages (100 and above) are grouped to avoid potentially erratic rates therein. The crude total (uni-sex) mortality rates are studied.
To illustrate the powerfulness of our proposed model, we consider the training sample of 1950–2006 and forecast the mortality rates for 2007–2016 as the baseline results. Out-of-sample forecasts of the eleven investigated models: STAR, LC-H, LC-G, STAR-H, STAR-G, LC, SVAR, APC, HU, RH, and MA are presented and compared. Next, we conduct three sets of relevant sensitivity analyses, including robustness check on the forecasting horizon, sample period, and age groups. Additional analyses with the model confidence set (MCS) and Bayesian modeling average (BMA) are conducted at the end of this section.Footnote 2
5.1. Baseline results
To compare the forecasting performance across all models, we follow existing studies such as Li and Lu (Reference Li and Lu2017) and employ the RMSFE examining all age groups and forecasting steps:
In the baseline case, we select $h=10$ , such that the training sample of 1950–2006 is fitted, and the out-of-sample forecasts are produced over 2007–2016 for each population. The ten individual models: STAR, LC-H, LC-G, STAR-H, STAR-G, LC, SVAR, APC, HU, and RH are first estimated, following necessary tuning parameter selection procedures described in Section 3.4. Weights of those models are then chosen following steps listed in Section 4.2. All forecasts of each of the eleven models are then collected to compute their corresponding RMSFE. This procedure is executed for all the ten examined populations.
All specific RMSFEs are presented in Table 1. To facilitate the comparison, the ranking of each model is displayed in Figure 2, and a black color indicates a higher place. On average, it can be seen that the five age-coherent models beats the five age-incoherent counterparts. More importantly, contrasting the six age-coherent sets of forecasts, those of MA outperform individual models in eight out of the ten populations and are the second or third best for the rest. This strongly supports the dominant forecasting accuracy using our proposed MA technique. As for the specific metrics, on average, the RMSFE of MA is 0.1707, which is almost 25% smaller than that of LC (0.2360). A summary of those baseline RMSFEs can be found in Table 2, which is presented when the MCS analysis is performed.
We now compare the individual models included in the ensemble. On average, STAR produces the lowest RMSFE, and results of the other four age-coherent models are relatively close to each other. Specifically, the average ranking of STAR is 3.6, and those of LC-H, SVAR-H, and SVAR-G are close to 5, whereas LC-G ranks somewhat lower at 7. Further, RMSFEs of SVAR-H and SVAR-G are much similar throughout all cases. Those of LC-H and LC-G are also not too far away from each other, although the difference is visually larger than those in SVAR-H and SVAR-G. For the five age-incoherent models, SVAR and HU result in similar overall performance, whereas LC, APC, and RH rank at the end of all considered models.
5.2. Sensitivity analyses
To check the robustness of baseline results, we consider three sets of sensitivity analyses individually:
-
1. Forecast steps are increased to $h=16$ , as examined in Gao and Shi (Reference Gao and Shi2021);
-
2. The starting year is truncated to 1960; and
-
3. The age range is reduced to 0–89.
The ranking of all models examined in each set is plotted in Figure 3(a)–(c), and corresponding RMSFEs are summarized in Table 2 in the Supplement Materials. The ranking plot suggests that our proposed MA approach works consistently well by altering individually the forecasting horizon, weights construction, sample period, and age groups. It almost uniformly ranks among the top 3 models and produces the smallest average RMSFE in all scenarios. Nevertheless, it is worth noting that STAR model ranks the second best in all cases, whereas LC, APC, and RH are the three least preferred models. The overall performance of the five age-coherent models is uniformly better than that of the five age-incoherent models, by averaging the RMSFEs across relevant models and populations.
In summary, we employ the popular criterion RMSFE and demonstrate that the proposed MA approach can overall improve the forecasting performance of all models included in the ensemble. This conclusion is robust when various sensitivity analyses are conducted with individual adjustment being made.
5.3. Comparison with the model confidence set and Bayesian modeling average approaches
In their seminal work, Hansen et al. (Reference Hansen, Lunde and Nason2011) propose an MCS approach that chooses the subset of superior models out of a range of candidate. Those superior models are identified by testing if they can be assumed to present equal predictive ability at a given confidence level. Such a test may be performed for any loss functions, including the popular squared losses employed in this paper. Briefly speaking, the MCS test works using bootstrap samples sequentially, by eliminating the worst model at each stage. The procedure stops until the null hypothesis of equal predictive ability of the remaining model cannot be rejected.
To employ the MCS for mortality forecasting, we produce squared forecast errors over the last 1/4 of the training period (i.e., 1993–2006) for each individual model. Similar to Kessy et al. (Reference Kessy, Sherris, Villegas and Ziveyi2021), we choose a confidence level of 10%, and the selected models for each population are presented in Table 2. It is interesting to note that LC-H and LC-G are the two mostly selected model in the MCS, followed by STAR. On the other hand, LC, APC, and RH models are not picked in any of the ten populations.
Assuming the equal weights for all ages and selected models, we construct this new MA approach via the MCS method. Together with the baseline results, the RMSFEs are summarized in Table 3, and individual ranks are illustrated in Figure 3(d). Clearly, our proposed MA approach beats the MCS according to all the five metrics presented in Table 3. For RMSFE of individual populations, our approach results in more accurate forecasts in seven out of ten countries. Additionally, given that MCS significantly reduces the number of selected models in the ensemble, we conclude that our approach is preferred for both the achieved diversity and resulted forecasting accuracy.
Note: bold numbers are the smallest quantity for each statistic across the twelve models.
Alternative to the GMVP and MCS, another potentially useful model averaging approach is the Bayesian modeling average (BMA). Specifically, weights of the BMA are essentially posterior model probabilities. Detailed calculations of those weights can be found in Raftery et al. (Reference Raftery, Madigan and Hoeting1997), and a full BMA usually requires a computational intensive approach, such as the Markov chain Monte Carlo. Fortunately, in practice, seminal works including Bates and Granger (Reference Bates and Granger1969) have suggested that weights of a full BMA can be well approximated by employing a much simpler Akaike or Bayesian information criterion. Applications of such an approximation to mortality data can be found in Shang (Reference Shang2012). However, Wagenmakers and Farrell (Reference Wagenmakers and Farrell2004) point out that if one model in the ensemble is dominant according to the used information criterion, the posterior probability of a BMA may tend to be close to one, also known as the “winner-take-all” issue. Consequently, we consider an alternative BMA technique discussed in Kontis et al. (Reference Kontis, Bennett, Mathers, Li, Foreman and Ezzati2017), and age-specific weights are approximated by
where $PB_{x,j}=\sum_{i=1}^{h}(\hat{y}_{x,j,T+i}-y_{x,j,T+i})/h$ is the projection bias for logged mortality rates of age x produced by model j. Intuitively, due to the closeness of mortality forecasts, no $PB_{x,j}$ could be universally small or close to 0 to cause a dominant weight in a BMA mortality model.
Using the age-specific BMA weights of all models, the RMSFE results are summarized in the last row of Table 3. Overall, the results of BMA outperform all but those of our MA approach. Two conclusions could be drawn. First, BMA forecasts could serve as an additional robustness check to our baseline results. The altered factor is the weighting strategy in the model averaging. Second, the overall outperformance of MA over BMA supports the superior effectiveness of our proposed approach. Despite that no dominant weights may exist in the employed BMA, such an approach is “deterministic” and formula based. Thus, it cannot cope with the coherent and smoothness penalties. Consequently, the desirable asymptotic age coherence will be lost for the resulting forecasts, and abrupt changes in mortality forecasts are inevitable, even for neighboring age groups. This might also contribute to its observed inferior forecasting performance to our MA approach.
6. A case study and illustration on annuity pricing
In this section, we comprehensively present a case study by investigating the Italian population using the examined models. Curves presented in Figure 1 have visualized the dynamics of associated mortality rates. To illustrate the practical usefulness of the proposed MA approach, we demonstrate the age-specific forecasting results and its long-term forecasting results. We also validate the proposed objective function of the MA method by contrasting relevant results with those of BMA. An application on annuity pricing can be found in end of this section.
6.1. Forecasting results
Instead of examining a single metric, we employ the RMSFE over the age dimension to compare forecasting performance of each model. Consistent with our baseline setting discussed in Section 5.1, the training period of 1950–2006 is employed to fit in-sample estimates, and out-of-sample forecasts are calculated for the test period of 2007–2016.Footnote 3 At each of the 101 one-year age groups (0–100), we calculate the age-specific RMSFE as
The results are summarized in Table 4.
Note: bold numbers are the smallest quantity for each statistic across all models.
In Table 4, it can be seen that the proposed MA approach still leads to the smallest average RMSFE over ages, with the narrowest variation. Specifically, the mean (standard deviation) of MA’s RMSFE is 0.0859 (0.0604), over 50% (80%) smaller than that of LC. Thus, for the Italian population, MA can improve the forecasting accuracy of individuals models in the ensemble.
We explore out-of-sample interval estimates in Figure 4, where the data investigated are the logged mortality rates averaged out of all ages. The results of MA and SVAR-H, the top two best models as evidenced in Table 4 are contrasted, and the 95% PIs of MA and SVAR-H are constructed using bootstrap replicates as outlined in Section 4.2. Clearly, compared to the mean estimates of SVAR-H, those of MA are overall closer to the true values spanning 2007–2016. This is consistent with our observations in Table 4. Also, although both PIs manage to cover the true values at all forecasting steps, the width of MA’s PI is uniformly narrower than that of SVAR-H. This implies that MA is potentially more efficient than the SVAR-H model. Such a finding is consistent with our discussion in Section 3. Specially, MA with weights estimated via the forecast error optimization may improve the forecasting uncertainty over all individual models.
Finally, we display the long-term forecasts as of 2051, a 35-step-ahead horizon in Figure 5. The full data of 1950–2016 are fitted to provide the estimates of parameters, and forecasts of MA and LC are generated and contrasted. Compared to the true values of 2016, forecasts of LC in 2051 demonstrate little improvements on both the very old ages and “accidental hump” (ages 20–25). Such results are against the observations in Figure 1(a), which suggests significant improvements even for those ages. In contrast, due to its age-coherence feature, forecasts of MA show more consistent improvements over all ages. Also, even for a considerably long period of 35 years, no abrupt changes are observed at any ages for the MA model. This supports the effectiveness of the smoothness penalty of $\lambda_2$ employed in (4.1).
6.2. Validation of the proposed weighting method
In Figure 1(b), we present the $RMSFE_x$ of age-incoherent models over the out-of-sample period 1993–2006. Those corresponding forecast errors are the core part in the loss to be minimized in (4.1). Consistent with this, we present $RMSFE_x$ of age-coherent models over the same period in Figure 6. To facilitate the comparison, RMSFEs of five age-incoherent models are grouped together, and their average value is also plotted (denoted as NC). Summarizing from the two figures, there are two major findings. First, as stated in Section 1, no single model can uniformly outperform the rest across all age groups. This validates the necessity of age-specific weights as employed in our MA approach. Second, the $RMSFE_x$ of age-incoherent model is overall higher than that of age-coherent models. Specifically, age-incoherent $RMSFE_x$ is quite high over young ages 0–25 and is still among the worst for ages beyond 50. This suggests that with the imposed coherent penalty $\lambda_1$ in (4.1), the estimated weights of age-incoherent models may be much lower than those of age-coherent models.
In Figure 7(a), we plot the final estimated weights of the MA model. The dynamics are much consistent with our findings in Figure 6. First, the total weights of age-incoherent models are small, due to their unfavorable RMSFE across ages discussed above. Also, fitted weights demonstrate much consistent patterns as their $RMSFE_x$ . For instance, the weights of SVAR-H and SVAR-G models are increasing over the oldest age groups, for their smallest RMSFEs at those ages. Additionally, all weights vary smoothly across ages, due to the impact of smoothness penalty $\lambda_2$ . More importantly, it is verified that no single model dominates in the identified age-specific weights, which supports the diversity realized by our MA approach. Last but not least, since SVAR-H and SVAR-G models produce much similar $RMSFE_x$ at all ages, their corresponding weights are close to each other.
We now contrast the MA results to those of BMA. Recall that in Section 5.3, an issue is identified for BMA, such that abrupt changes are unavoidable, due to the formula-based nature of BMA. To see this, we plot estimated weights of BMA in Figure 7(a). Clearly, those weights are identical to the proportions of $RMSFE_x$ of an individual model taking in the total RMSFEs at the same age, as shown in Figures 1(b) and 6. Consequently, rough dynamics are observed for all fitted weights, and sharp changes are observed in multiple ages. For instance, the weight of LC model in BMA drops from 0.1 to 0.05 from age 0 to 1, and then bounces back to 0.09 for age 2. Further, despite their overall unfavorable performance, age-incoherent models are still assigned considerably large weights. Averaging over all ages, weights of each model are close to 0.1, the value of which would have been assumed for a simple average method. As for the forecasting accuracy, the $RMSFE_x$ estimated by the BMA approach is reported in Table 4, the results are outperformed by the proposed MA counterparts.
In conclusion, to realize the coherent and smoothness penalties, a squared loss on weights will need to be implemented in the objective function. Consequently, the GMVP structure as adopted in (4.1) is a simple but effective approach, compared to other alternatives such as the BMA.Footnote 4 Besides, the incapability to incorporate those penalties will be faced by a novel approach proposed in Kessy et al. (Reference Kessy, Sherris, Villegas and Ziveyi2021). Although this stacked regression employ an Elastic NET (ENET) loss, the inclusion of the smoothness penalty is not straightforward. This is due to the fact that the smoothness penalty is not a standard L2-type loss, and a more complicated computational approach will need to be implemented (Chang and Shi, Reference Chang and Shi2021). The imposed equality and inequality constraints will further increase the computational cost.
6.3. A financial application: Fixed-term annuity pricing
To demonstrate the practical usefulness of the MA approach, a financial application is presented in this section. Following Shi (Reference Shi2022b), we consider the pricing of fixed-term annuities using the MA technique. This product has attracted a growing number of policyholders world-widely, especially those planning for their retirements. Comparing to the lifetime annuities, fixed-term products pay a predetermined and guaranteed income of higher level with deferrable options (Shang and Haberman, Reference Shang and Haberman2017). A cohort approach, as adopted in Shi (Reference Shi2022b), is employed here to price annuities. To be consistent with our previous analyses, the maximal survival age is limited to 100.
The specific pricing scheme is explained below. First, the $\tau$ year survival probability of a person aged x at $t = 0$ is
where $\tau {_1} p_{x+j-1} = e^{-m_{x+j-1}}$ and $m_{x+j-1}$ can be obtained from mortality forecasts. Hence, an annuity with a T-year maturity and written for an x-year-old person including a benefit $1 per year and conditional on the survival is priced as
where $I(\cdot)$ is the indication function, $T_x$ is the survival time and $P_B(0,\tau)$ is the price of a $\tau$ -year bond with interest rate set to the yield of annuity. Since $E(I(T_x>\tau)|m_{x,{1:T}})={_\tau}p_x(m_{x,{1:T}})$ , the fixed-term annuity price is fully determined by the underlying yield and mortality rates. Thus, to accurately price the product, it is critical to providing precise forecasts of $m_{x,{1:T}}$ , potentially over a long period. That is, the mortality experiences of policyholders need to be accurately projected to minimize the mis-pricing risks.
For illustration purpose, we employ up to 35-step-ahead forecasts (from 2017 to 2051) of the proposed MA model. The estimated prices of fixed-term annuities as of 2016 are presented in Table 5. The calculation is conducted for the Italian population up to the 30-year-maturity, starting from age 65. We follow Shi (Reference Shi2022b) and present results of the four ages: 65, 70, 75 and 80. Both point and 95% interval estimates are reported. For simplicity, a constant interest rate of 3% is assumed in all case. Thus, the only source of uncertainty is limited to the precision of mortality forecasting. Evidenced in Figure 4, the proposed MA technique is more efficient in providing PIs. This supports the small widths observed in Table 5. As argued in Fung et al. (Reference Fung, Peters and Shevchenko2015), underpricing of annuities as small as by 0.1% can lead to dramatic shortfall in reserving with a large portfolio. Consequently, precisely and efficiently determine the uncertainty of premium rate is critical for insurers to optimizing their reserves, such that the associated ruin probability is minimized.
Note: this table displays the forecast fix-term annuity price for the average population as of 2016. The forecast mortality rates range from 2017 to 2051. LB and UB stand for the 2.5th and 97.5th percentiles of the prediction interval, respectively. Mean is the point/mean forecast price. T is the maturity term. Value in bracket is the percentage difference compared to the forecast mean annuity price. We only consider contracts with maturity so that age + maturity $\le$ 100.
Specifically, a prospective lifetable constructed using the mortality forecast by the MA approach may largely reduce the longevity risks. As demonstrated in Figure 7, an age-incoherent model like LC may underestimate mortality improvements of old ages. For pension products, a corresponding prospective lifetable will then result in an underestimated reserve and thus increase the ruin probability. Using an age-coherent model like the MA approach to produce such a prospective lifetable will help address this issue.
7. Concluding remarks
In this study, we propose an effective ensemble or model averaging (MA) approach to study and forecast logged mortality rates. Three key results can be drawn from our research. First, the proposed MA model is effective in mortality forecasting by improving accuracy of all individual models in the ensemble. Altogether, we examine five age-coherent candidates: spatial temporal autoregressive (STAR) model proposed in Li and Lu (Reference Li and Lu2017), Lee–Carter with hyperbolic (LC-H) and geometric (LC-G) decays studied in Gao and Shi (Reference Gao and Shi2021), sparse VAR with hyperbolic (SVAR-H) decay examined in Li and Shi (Reference Li and Shi2021) and SVAR model with geometric decays (SVAR-G). Five age-incoherent models are also considered, including the LC model (Lee and Carter, Reference Lee and Carter1992), SVAR model (Guibert et al., Reference Guibert, Lopez and Piette2019), age-period-cohort model (Cairns et al., Reference Cairns, Blake, Dowd, Coughlan, Epstein, Ong and Balevich2009), functional demographic model (Hyndman and Ullah, Reference Hyndman and Ullah2007) and Renshaw-Haberman model (Renshaw and Haberman, Reference Renshaw and Haberman2006). Our data include 0–100 ages of ten European populations considered in Li and Lee (Reference Li and Lee2005), spanning 1950–2016. Robust results are also observed when the forecasting steps, sample period and covered ages are altered individually. As an alternative model averaging approach, we also explore the model confidence set (Hansen et al., Reference Hansen, Lunde and Nason2011) for mortality forecasting. The improvements in forecasting with our proposed MA approach are consistently demonstrated in all cases.
Second, the technical properties of MA are presented and discussed. The weights in MA are determined by optimizing the resulting variance of out-of-sample errors with non-negative constraints to improve interpretability. The considered out-of-sample error resolves the major issue of the based global minimum variance portfolio approach, such that the parameter risk is not considered. Moreover, since the weights are age specific, with a considerably large size of the ensemble space, the diversity is realized to present a potential “winner-take-all” issue. To further avoid overfitting and abrupt changes of forecast rates over adjacent ages, a smoothness penalty is imposed. Also, we employ a coherent tuning parameter to achieve the desirable age coherence asymptotically. The resulting forecasts of our MA method are then proved to be asymptotically age coherent. For its averaging nature, the MA approach can resolve the misspecification issue that each of the individual models may face. This explains its attractive forecasting performance in the empirical analyses.
Finally, MA can be a useful tool in the practical application. To demonstrate this, we use the Italian population and present a case study. In particular, interval results indicate that the MA approach may result in lower forecasting uncertainty than all individual models. Besides, via an analysis on the estimated age-specific weights, we show that the proposed MA approach is more appropriate than popular alternatives, such as the BMA. A financial application to the fixed-term annuities pricing demonstrates the usefulness in other practices. For instance, practitioners like insurers may benefit from adopting the MA technique to improve the reserving accuracy and thus reduce the ruin probability.
Apart from the improved forecasting accuracy, the proposed MA approach can shed light on two areas to consider for the actuarial practice. First, as plotted in Figure 7(a), age-specific weights illustrate the relative strength of a model for certain ages. Among the Italian population, for the workforce age groups (30–65), the LC-type age-coherent models are preferred, whereas SVAR-type counterparts take most weights on the oldest ages (90–100). This might suggest that mortality rates of workforce groups are less prone to the cohort effects, whereas oldest ages are associated with more cross-group impacts. Consequently, decisions on risk factors, such as the birth year, may have various impacts on all future ages (some could be insignificant, if the age fall in 30–65), when pricing the premium of the associated pension products. Second, the proposed objective function presents a simple but effective approach to handle the modeling errors. Specifically, due to the small sample size, without constraints, randomness in estimation could be high to reduce the reliability of estimates. Apart from adopting the out-of-sample forecast errors, our approach employs the coherent and smoothness penalties. Those constraints are reflective of prior information to effectively reduce the influence of the randomness introduced by small samples. Implications can be made when computing the solvency capital requirement (SCR) in the European Union region. For instance, to better improve the reserving efficiency, the SCR may honor appropriate prior information that is adopted to reduce modeling error for the internal models employed by an insurer.
There are also some pathways for future research to extend the technical features of the proposed model. For instance, robust optimization technique (Bertsimas et al., Reference Bertsimas, Brown and Caramanis2011) may be implemented when solving (4.1) to optimal weights. This may further improves the robustness of the fitted MA, as the size of out-of-sample forecast error is usually limited. Also, as pointed out in Brouhns et al. (Reference Brouhns, Denuit and Vermunt2002), mortality models based on logged mortality rates frequently perform better for young ages and worse for old ages, compared to those on the original scale. The reason is that the logarithm of observed force of mortality is much more variable at older ages than at younger ages because of the much smaller absolute number of deaths at older ages. Since life insurance and pension products may be more relevant to mortality patterns in old ages, age-coherent approaches are to be explored using mortality rates at the original scale. For instance, in their seminal work, Brouhns et al. (Reference Brouhns, Denuit and Vermunt2002) incorporate the LC methodology in the modeling of Poisson-distributed death counts via maximum likelihood estimation. Thus, the LC-H and LC-G approaches may be extended to this framework, to obtain the age-coherent forecasts considering the additional information of exposure to risk. VAR-type models can be developed based on the AR models with Poisson response variables (see, for example, Brandt and Williams, Reference Brandt and Williams2001). Such explorations remain for future works.
Acknowledgment
The authors would like to thank the Australian National University and Macquarie University for the research support. We particularly thank the Editor (Montserrat Guillen) and two anonymous referees for providing valuable and insightful comments on earlier drafts. The usual disclaimer applies.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/asb.2022.23