Support Vector Machines as tools for mortality graduation

A topic of interest in demographic and biostatistical analysis as well as in actuarial practice , is the graduation of the age-specific mortality pattern. A classical graduation technique is to fit parametric models. Recently, particular emphasis has been given to graduation using nonparametric techniques. Support Vector Machines (SVM) is an innovative methodology that could be utilized for mortality graduation purposes. This paper evaluates SVM techniques as tools for graduating mortality rates. We apply SVM to empirical death rates from a variety of populations and time periods. For comparison, we also apply standard graduation techniques to the same data.


Introduction
Representing the age-specific mortality pattern of a population is of particular interest in demographic analysis, biostatistics, and actuarial practice. For nearly two centuries demographers, biostatisticians, actuaries, and social workers have shown great interest in the means of representing the age-specific mortality patterns of populations. Demographers want to describe and project the mortality pattern of a population for the purpose of mortality analysis, as well as to provide population projections. Biostatisticians need a basis for making mortality forecasts. Actuaries need a mortality basis suitable for calculations in life insurance and in designing social security systems. Social planning also requires estimations and projections of age-specific mortality.
In order to estimate the unknown age-specific probabilities of dying that underline the empirical measures, we can use graduation techniques applied to empirical death rates, under the assumption that the true probabilities follow a smooth pattern through age. For the purpose of graduation, several parametric and non-parametric techniques have been proposed. Parametric functions of age, commonly known in demography as mortality laws, have been in use for more than a century. The earliest attempt to provide such a formula was by de Moivre in 1725, while the most widely known law of mortality was proposed by Gompertz in 1825. Keyfitz (1982) provides a review of these historical laws. In modern times, many authors have contributed to the theory of parametric models of mortality, and to the problem of estimating their parameters (e.g., Heligman and Pollard 1980;Keyfitz 1982;Forfar et al. 1988;Kostaki 1992;Hannerz 1999;Karlis and Kostaki 2000).
Recently the utilization of non-parametric smoothing techniques for graduation purposes has gained attention. Among these techniques, special attention is given to kernels (Cobas and Haberman 1983). An evaluation of kernels as tools for graduating the mortality pattern is provided by Kostaki and Peristera (2005). Support Vector Machines (SVM) is a modern non-parametric graduation methodology that appeared in the mid-nineties in the framework of Vapnik's Statistical Learning Theory (Vapnik 1995;Moguerza and Muñoz 2006). Since SVM techniques have shown very successful results in smoothing noisy data, such as neighbourhood curves (Muñoz and Moguerza 2005) or nonlinear profiles (Moguerza et al. 2007), they can probably serve as an equally useful tool for mortality graduation purposes. Regarding demographic data, SVM have shown an interesting performance when applied to the graduation of age-specific fertility patterns (Kostaki et al. 2009). These techniques are easy to adjust, which implies they can be easily applied by demographers who may lack a thorough background on Statistical Learning Theory or pattern recognition.
This work provides an evaluation of the SVM methodology in the context of mortality graduation. Section 2 provides a summary description of proven graduation techniques, i.e., kernels and parametric models. Section 3 is devoted to a presentation of the SVM methodology. Then, in Section 4 an evaluation is provided of the utilization of SVM methodology for the graduation of age-specific death rates. Namely, we apply SVM to empirical death rates for several populations and time periods. Additionally, for comparison purposes, kernels are also applied, and the Heligman-Pollard model (Heligman and Pollard 1980), is fitted to the same datasets. Finally, in Section 5 some concluding remarks are provided.

Laws of mortality
Parametric modeling is widely used in demography for graduation purposes since it provides results with the highest degree of smoothness. Detailed presentations of the features of parametric models are given by Keyfitz (1982), Kostaki (1992), Congdon (1993), and Karlis and Kostaki (2000). A huge variety of mortality laws has been presented in the literature since 1725. Among them, the most successful attempt to describe the mortality pattern for total life span through a parametric model might be the one proposed by Heligman and Pollard (1980). This model is described by the formula , where q x is the probability of dying within a year, p x = (1 − q x ), and A to H are parameters to be estimated. It includes eight parameters, all of them having demographic interpretation. The first additive term of the right-hand side of the formula describes mortality of the childhood ages. It includes three parameters: A, which reflects the level of childhood mortality; C, related to the rate of mortality decrease in childhood ages; and B, which is indicative of the mortality level at age zero. The middle term reflects accident mortality and it also includes three parameters: D, related to the severity of the accident hump; E, related to its spread; and F, indicating the location of the hump. Finally, the third term includes two parameters: G, reflecting the level of later adult mortality; and H, related to the rate of mortality increase at the later adult ages. Heligman and Pollard (1980) estimated these parameters using a least-squares approach, in order to minimize the sum of squares , where x q is the fitted value at age x and q x is the observed mortality rate.

Kernel techniques
Consider a set of observations of two variables X and Y, i.e., data of the form (x i ,y i ), i = 1,…, p, which are related via an unknown regression function m as follows: , where the ε i are independent random variables, with zero mean and constant variance.
The problem now consists in estimating the unknown function m. In order to estimate m at a point x the values of the response variable are locally averaging. The width of the neighbourhood over which averaging is performed; called bandwidth, controls the smoothness of the resulting estimator. Hence, an estimator of the function m of the following type is used: where W h is a weight function depending on the bandwidth parameter h and the set of variables X 1 ,… X n .
A conceptually simple approach to representing the weight function W h is to describe its shape by a density function called the kernel function, with a scale parameter h, i.e., the bandwidth, which adjusts the size and the form of the weights near x. Therefore, kernel regression estimators are local weighted averages of the response variable, whose weights are determined by the kernel function K, while the size of the weights depends on the bandwidth parameter h.
Generally, the kernel function K has the fundamental properties of a probability density. In the regression context, the kernel function is generally a smooth, positive function, which peaks at zero and decreases monotonically as the bandwidth parameter increases in size.
Several formulae have been proposed for the kernel estimator m of the regression mean function m, depending on the type of the kernel regression estimator used. An extensive presentation of these formulae is provided in Kostaki and Peristera (2005). Among the alternative estimators, Kostaki and Peristera (2005) have shown that the one by Gasser-Muller (Gasser and Muller 1979;1984) has proved the most adequate in the context of mortality graduation.
At a point x, the Gasser-Muller estimator is given by the formula , where x 0 = −∞, x n = +∞, and x (i) denotes the ith-largest value of the observed covariate values and Y [i] is the corresponding response value. Appropriate selection of the bandwidth parameter is of great importance, since it controls the degree of smoothness and consequently influences the resulting estimator. A presentation of bandwidth selection techniques can be found in Hardle (1990;1991) and Kostaki and Peristera (2005). One approach to selecting the bandwidth parameter is to construct a direct plug-in estimator of the optimal smoothing parameter h opt . Gasser et al. (1991) give expressions for the h opt appropriate to the Gasser-Muller estimator, and describe how the unknown quantities can be effectively estimated. An important issue for the selection of bandwidth is the choice between global and local. Local bandwidth selection allows obtaining a bandwidth that adapts for local efficiencies in different parts of the design points, which means that a smaller bandwidth is used in areas of high density while the value of the bandwidth increases in areas of low density. Brockmann et al. (1993) and Hermann (1997) have mentioned the advantage of using kernel regression estimators with a local bandwidth instead of a global one. The main idea of the plug-in method is to estimate the optimal bandwidths by estimating the asymptotically optimal mean-integrated squared-error bandwidths. For the selection of a local bandwidth, Hermann (1997) developed an iterative plug-in algorithm that is a generalization of the global iterative plug-in algorithm of Gasser et al. (1991). A description of this algorithm can be found in Hermann (1997), where the advantage of this approach over the cross-validation method and the global plug-in rule is highlighted.

Support Vector Machines
Support Vector Machines (SVMs) appeared in the middle nineties in the framework of Vapnik's Statistical Learning Theory (Vapnik 1995;Moguerza and Muñoz 2006), providing very successful results for the smoothing of noisy data such as neighbourhood curves (Muñoz and Moguerza 2005) or nonlinear profiles (Moguerza et al. 2007). Support Vector Machines are part of regularization methods that also include Splines (Moguerza and Muñoz 2006). In fact, there is a close relation between both methodologies, SVM and Splines (Pearce and Wand 2006). Next we provide a description of the regression version of SVM and its main features.

Support Vector Machines for regression
Presenting the geometrical interpretation of SVM for regression, we note that from a practical point of view, regression SVM can be formulated as a convex quadratic optimization problem (therefore, without local minima) of the form where (xᵢ , yᵢ ), i = 1,…,p are a set of data with xᵢ  Rⁿ and yᵢ  R, ξᵢ, and ξʹᵢ are slack variables which permit the violation of a boundary determined by ε. Φ: Rⁿ → R m is a mapping defining the kernel function K:X × X → R (for instance, the space X may be defined as Rⁿ ), such that K(x,y) = Φ(x) T Φ(y). In this way, geometrically Φ maps the data from the so-called "input space" (that is, Rⁿ ) into the "feature space" (that is, R m ). One of the key issues of SVM is how to use Φ(x) to map the data into a higher-dimensional space. To achieve this task, a kernel approach is used in order to operate in the "feature space" without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the "feature space." The three most widely used kernels are: the linear kernel K(x,y) = x T y, which corresponds to the identity mapping; the polynomial kernel K(x,y) = (c + x T y) d , where c and d are constants, which maps the data into a finitely dimensional space; and the Gaussian kernel , where σ is a positive constant, which maps the data into an infinitely dimensional space. The role of the kernel is crucial within the SVM methodology. Depending on the kernel used, the approximation capacity of the methodology will be different. In this way, the linear kernel (the simplest one) will be useful for the approximation of linear functions, while the Gaussian kernel will be suitable for the approximation of nonlinear functions. Given its approximation capacity, the Gaussian kernel is the most extensively , , , It can be shown (see Moguerza and Muñoz 2006) that where w * and b * are the values of w and b at the solution of the quadratic optimization problem. In practice, the optimization problem to solve is not the primal formulation shown above. For practical purposes, the problem to solve is the "dual problem" (Schölkopf et al. 2000), that is: It can be shown that both problems, primal and dual, are equivalent, and that where αᵢ = λᵢ * − λᵢʹ * , being λᵢ * and λᵢʹ * the values of λᵢ and λᵢʹ at the solution of the dual problem. Therefore, in practice, the estimated parameters are the α coefficients, whose number is p, that is, the number of data. In this way, the relationship between kernels and SVM is clear: only the closed form of the kernel K is needed, and not the explicit mapping Φ. Notice that this distinctive peculiarity allows, for instance, the use of the Gaussian Kernel in order to evaluate f * (x). Moreover, in practice, only a small percentage of the α coefficients will differ from zero, which makes simpler the evaluation of this function (this is one of the advantages of SVM; see Moguerza and Muñoz 2006), and reduces the number of estimated parameters.

Piecewise Support Vector Machine (PSVM)
The standard SVM described above can be specialized in order to treat functions whose derivatives take large values within some intervals of the range of support values, and small values within other intervals of the range of support values. With this aim we define the Piecewise Support Vector Machine (PSVM) method.
The key point of his method is to train a SVM for each predefined interval, and then calculate the breakpoints between intervals as a function of the piecewise smoothers. In the case of mortality data, two intervals of the same length have been considered in order to divide age x. The first interval corresponds to the subset of the curve domain with stationary points, that is, points where the first derivative equals zero. The second interval corresponds to the subset of the curve domain where the function has an increasing behaviour, that is, where the first

Evaluation and comparisons
Our calculations are based on the empirical age-specific mortality rates of the male and female populations of Sweden, for the periods 1981-5, 1984-8, and 1991-5, as well as France and Japan for the years 1990, 1991, and 1995. The Swedish datasets are taken from Statistics Sweden, while the French and Japanese ones are parts of the Berkeley Mortality Database, available in the web via the address http://www.demog.berkeley.edu/wilmoth/mortality.
For kernel applications, the subroutine "glkerns" of the library "glkern" from the R-package is used for the calculation of Gasser-Müller estimators with bandwidth parameter. This is available at http://www.unizh.ch/biostat/software. In order to select the bandwidth for a Gaussian kernel regression estimator, trials were made using a direct plug-in technique (Ruppert et al. 1995)-in particular, the one implemented in the KernSmooth library-and the R-package. However, this methodology has been discarded given the overfitting observed above. Therefore, the bandwidth parameter has been computed by cross-validation, leading to a value of 2.3849 for all the estimated curves. In this way, we have a unique model for all the datasets.
The parameters in Heligman-Pollard model are estimated using an iterative routine of the Nag library that is based upon a modification of the Gauss-Newton algorithm, described by Gill and Murray (1978).
For the SVM applications, the subroutine "svm" of the library "e1071" of the R-package is used to derive the SVM and the PSVM model parameters. This is available at http://cran.r-project.org/. A two-step simulation procedure is used to select the parameters ε, σ, and C of the ε-regression procedure: ε is used to fix the width of a band around the fitted curve, σ plays the role of a variance, and C is an upper bound for the λ coefficients in the dual optimization problem and, at the same time, penalizes the values of the slacks corresponding to those points lying outside of the band determined by ε in the primal optimization problem. As a first step, the ranges of parameters ε, σ, and C are determined. Then, in the second step, the best combination of the three parameters is computed using crossvalidation techniques. In particular, the values ε = 0.02, σ = 125, and C = 2,200 were obtained for the SVM implementation. For the PSVM implementation, the values ε = 0.11, σ = 111.1, and C = 3,900 were obtained for the first interval, and values ε = 0.008, σ = 175.4, and C = 50 were obtained for the second interval, while the solution for the breakpoint x b were calculated as an average function of f l * and f 2 * . It can be observed that the parameters for the SVM implementation are approximately an average of the parameters obtained for each interval of the PSVM implementation. The parameters change so drastically between the two intervals because the structure of the curve is significantly different within each interval. In this way, with the PSVM we are able to capture in a better way the local structure of the curves.
In this application, the values for the corresponding dimensions in the SVM model are n = 1, m = 1 (given that this is the dimension induced by the Gaussian kernel; see Moguerza and Muñoz 2006), and p = 83, that is, the number of data within each set. We should note here again that the same set of parameter values is used for all the datasets. In this way, we are able to make fair comparisons of these results with those produced by kernels.
A mortality graduation can be considered successful if the graduated rates progress smoothly from age to age, and at the same time accurately reflect the underlying mortality pattern while avoiding systematic deviations and random variations. In this sense, we are going to evaluate the effectiveness of different adopted approaches for the graduation of our mortality datasets.
Although graphical representation of the observed and the graduated rates is a useful way to derive conclusions, we also use statistical criteria in order to evaluate the performance of the alternative estimators. For that, we use a chi-square criterion to check the closeness of the graduated rates to the observed ones. Then in order to evaluate smoothness of the results we calculate the sum of the absolute values of the third differences for each graduation.
The chi-square criterion, used for evaluating adherence of the results to the observed rates, is defined as , where Eₓ is the exposed-to-risk population at age x, qₓ is the observed death rate at age x, x qₓ is the graduated one, and Eₓ / [qₓ (l − qₓ)] are the reciprocals of the variances of the observed qₓ.
Finally in order to check for smoothness of the resulting probabilities, we examine the third-order differences of the graduated values. We therefore calculate the sum of the absolute values of the third differences in each graduated set of values, i.e., the quantity , multiplied by 100,000 in order to have an easier interpretation of the results.
The values of the two criteria for all the datasets used, and all graduation techniques used, are presented in Tables A1-A3 (Appendix A). Table A4 presents average results for the overall data. Examining these values, one can easily observe that the SVM graduation proves adequate in terms of goodness of fit, as well as in terms of smoothness. Considering the values of χ 2 quantity, for the Swedish and the Japanish datasets, these are in almost all cases lower for the SVM than for the HP8 and kernels. However, for the French datasets the results for the two SVM techniques, and especially those for the PCVM one are clearly superior to those obtained for the other two techniques. Considering the overall values of the χ 2 criterion presented in Table A4, we conclude that both SVM techniques prove superior to the other two methodologies.
Considering smoothness, the values of the sum of third-order differences, in almost all cases, and overall were lower for the two alternative SVM techniques than for the other two methods.
Comparing the values of both SVM and PSVM criteria, we conclude that PSVM proves superior to SVM in terms of goodness of fit. However in terms of smoothness, SVM in many cases provides somewhat better results than PSVM. Figures B1-B6 (Appendix B) illustrate the results for some chosen cases. As clearly observable in these illustrations, SVM and PSVM show a successful performance, especially in the most difficult parts of the age interval, i.e., the early adult ages. Figures B7-B18 illustrate the results of each technique separately for some chosen cases. It is clear in these figures that the results of the SVM techniques are closer to the empirical data than those of the Heligman-Pollard formula, the latter exhibiting some systematic deviations in the early adult ages. It is also clear that SVM techniques provide better results than kernels regarding both goodness of fit and smoothness.

Remarks
In this paper we proposed the application of Support Vector Machines techniques as tools for graduating age-specific mortality patterns. For evaluation purposes we applied SVM methodology to empirical datasets of a variety of populations and time periods. In addition, for comparison we also applied kernels and fit the Heligman-Pollard formula to the same datasets. The results of our calculations indicate that SVM techniques prove to be adequate, and in most cases superior, to the other two graduation techniques, providing results that are closer to the empirical values when compared to the Heligman-Pollard model and kernels, and smoother than those provided by kernels. An advantage of non-parametric graduation techniques compared to parametric modeling is that these are more flexible and can adequately be applied to all datasets. Meanwhile, in datasets with distorted patterns the use of standard models is inadequate; more complicated formulae are required in such cases. Furthermore, regulation of the degree of smoothness by the user can also be considered an advantage, allowing the user to choose the optimal degree of smoothness, depending on the purpose of graduation at hand, and also avoiding oversimplification of age patterns. Regarding future extensions of this work, SVM can easily be used as a multivariate model, providing a promising area for further research on demographic problems.