Robust and Efficient Mediation Analysis via Huber Loss

WenWu Wang; Xiujin Peng; Tiejun Tong

doi:10.1017/psy.2024.28

Robust and Efficient Mediation Analysis via Huber Loss

Published online by Cambridge University Press: 13 January 2025

WenWu Wang ,

Xiujin Peng and

Tiejun Tong

Show author details

WenWu Wang*: Affiliation:
School of Statistics and Data Science, Qufu Normal University, Qufu, China
Xiujin Peng: Affiliation:
School of Statistics and Data Science, Qufu Normal University, Qufu, China
Tiejun Tong: Affiliation:
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
*: Corresponding author: WenWu Wang; Email: wangwenwu@qfnu.edu.cn

Article contents

Abstract
Introduction
Simple mediation model
Robust and efficient estimation via Huber loss
Simulation studies
Real data analysis
Discussion
References

Rights & Permissions

Abstract

Mediation analysis is one of the most popularly used methods in social sciences and related areas. To estimate the indirect effect, the least-squares regression is routinely applied, which is also the most efficient when the errors are normally distributed. In practice, however, real data sets are often non-normally distributed, either heavy-tailed or skewed, so that the least-squares estimators may behave very badly. To overcome this problem, we propose a robust M-estimation for the indirect effect via a general loss function, with a main focus on the Huber loss which is more slowly varying at large values than the squared loss. We further propose a data-driven procedure to select the optimal tuning constant by minimizing the asymptotic variance of the Huber estimator, which is more robust than the least-squares estimator facing outliers and non-normal data, and more efficient than the least-absolute-deviation estimator. Simulation studies compare the finite sample performance of the Huber loss with the existing competitors in terms of the mean square error, the type I error rate, and the statistical power. Finally, the usefulness of the proposed method is also illustrated using two real data examples.

Keywords

data-driven tuning constant Huber loss indirect effect iteratively reweighted least-squares M-regression

Type: Theory and Methods
Information: Psychometrika , First View , pp. 1 - 23

DOI: https://doi.org/10.1017/psy.2024.28 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-ShareAlike licence (http://creativecommons.org/licenses/by-sa/4.0), which permits re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

In social sciences and related areas, the effect of an exposure on the outcome variable is often mediated by an intermediate variable. Mediation analysis aims to identify the direct effect of the predictor on the outcome and the indirect effect between the same predictor and the outcome via the change in a mediator (MacKinnon, Reference MacKinnon2008). Since the seminal paper of Baron and Kenny (Reference Baron and Kenny1986), mediation analysis has become one of the most popular statistical methods in social sciences. Empirical applications of mediation analysis have dramatically expanded in sociology, psychology, epidemiology, and medicine (Lockhart et al., Reference Lockhart, MacKinnon and Ohlrich2011; Newland et al., Reference Newland, Crnic, Cox, Mills-Koonce and Investigators2013; Ogden et al., Reference Ogden, Carroll, Curtin, Lamb and Flegal2010; Richiardi et al., Reference Richiardi, Bellocco and Zugna2013; Rucker et al., Reference Rucker, Preacher, Tormala and Petty2011). In practice, however, researchers have found that the assumptions of traditional mediation analysis methods, e.g., normality and no outliers, do not match the data they collected, which may lead to misleading results (Preacher, Reference Preacher2015; Yuan & MacKinnon, Reference Yuan and MacKinnon2014). To overcome the problem, it is often required to adopt some sophisticated models for mediation analysis (Frölich & Huber, Reference Frölich and Huber2017; Lachowicz et al., Reference Lachowicz, Preacher and Kelley2018; VanderWeele & Tchetgen, Reference VanderWeele and Tchetgen2017). For more details on mediation analysis, one may refer to the recent books including, for example, MacKinnon (Reference MacKinnon2008), VanderWeele (Reference VanderWeele2015), and Hayes (Reference Hayes2023).

One important issue in mediation analysis is to conduct the inference on the indirect effect, with a main focus on testing its statistical significance. In this direction, the first approach is the causal steps approach (Baron & Kenny, Reference Baron and Kenny1986), which specifies a series of tests of links in a causal chain. Moreover, some variants of this method that test three different hypotheses have also been proposed (Allison, Reference Allison1995; Kenny et al., Reference Kenny, Kashy, Bolger, Gilbert, Fiske and Lindzey1998). The second approach is the difference in coefficients approach (Freedman & Schatzkin, Reference Freedman and Schatzkin1992), which takes the difference between a regression coefficient before and after being adjusted by the intervening variable. The third approach is the product of coefficients approach which involves paths in a path model (MacKinnon et al., Reference MacKinnon, Lockwood and Hoffman1998, Reference MacKinnon, Lockwood and Williams2004; Sobel, Reference Sobel1982). MacKinnon et al. (Reference MacKinnon, Lockwood, Hoffman, West and Sheets2002) compared 14 methods of testing the statistical significance of the indirect effect and found that the difference in coefficients approach and the product of coefficients approach have a better control on the type I error rate as well as a higher power in most cases. And between them, the product of coefficients method is more widely used mainly thanks to its clear causal path explanation (MacKinnon et al., Reference MacKinnon, Lockwood and Williams2004; Preacher & Hayes, Reference Preacher and Hayes2008; Preacher & Selig, Reference Preacher and Selig2012; Yuan & MacKinnon, Reference Yuan and MacKinnon2014).

To estimate the indirect effect, the least-squares (LS) regression is routinely applied, which is also the most efficient when the errors are normally distributed. In practice, however, real data sets are often non-normally distributed, either heavy-tailed or skewed (Field & Wilcox, Reference Field and Wilcox2017). As an example, Micceri (Reference Micceri1989) examined $440$ data sets from the psychological and educational literature and found that none of them were normally distributed at the $\alpha =0.01$ significance level. When applied to non-normal data sets, the LS estimators may behave very badly (Huber & Ronchetti, Reference Huber and Ronchetti2009). To circumvent such drawbacks, some robust approaches have recently emerged in the mediation literature. Zu and Yuan (Reference Zu and Yuan2010) adopted the local influence function to identify the strongly-affected outliers. Yuan and MacKinnon (Reference Yuan and MacKinnon2014) proposed the least-absolute-deviation (LAD) regression when the errors are heavy-tailed, and moreover, Wang and Yu (Reference Wang and Yu2023) established the statistical theory for the LAD estimation of the indirect effect. Lastly, as claimed by Preacher (Reference Preacher2015), mediation analysis for non-normal variables has become an active research field.

To move forward, it is noteworthy that the LS and LAD estimators are special cases of the M-estimators, which minimize a specified loss function (Hansen, Reference Hansen2022; Serfling, Reference Serfling2001). Another popular loss function in the M-regression is known as the Huber loss function, which utilizes a tuning parameter to adjust the tail of the standard normal distribution (Huber, Reference Huber1964). This tuning parameter controls the trade-off between the efficiency and robustness. Wang et al. (Reference Wang, Lin, Zhu and Bai2007) found that the Huber loss function with the optimal tuning parameter can greatly improve the efficiency when maintaining the robustness. To the best of our knowledge, little work has been done on estimating the indirect effect from the perspective of the optimal loss.

This article proposes to further advance the literature by developing robust estimation of the indirect effect. To be specific, our approach mainly alleviates effects in the response variable and implicitly assumes that there is no large leverage points in the independent variables. In Section 2, we introduce the M-regression in the simple mediation model with a general loss function. An iteratively reweighted least-squares algorithm is also proposed to numerically solve the M-regression, as well as to construct two robust confidence intervals. In Section 3, we propose a data-driven approach to select the optimal tuning constant, and moreover study the statistical properties specifically for the Huber loss. In Section 4, we conduct simulation studies to assess the finite sample performance of the Huber loss and compared it with the existing competitors used in mediation analysis. We further illustrate the advantages of our method by an empirical example in Section 5, and conclude the article in Section 6 with some discussion.

2 Simple mediation model

The simplest mediation model is given in Figure 1, where X is the independent variable, Y is the dependent variable, and M is the mediating variable that mediates the effects of X on $Y.$ Given the observations $(X_{i}, M_{i}, Y_{i})$ for $i=1, \ldots , n$ , this simple mediation model consists of three linear regression equations as

(1)

$$ \begin{align} Y_{i}&=\beta_{1}+cX_{i}+\epsilon_{1, i}, \end{align} $$

(2)

$$ \begin{align} M_{i}&=\beta_{2}+aX_{i}+\epsilon_{2, i}, \end{align} $$

(3)

$$ \begin{align} Y_{i}&=\beta_{3}+c^{\prime}X_{i}+bM_{i}+\epsilon_{3, i}, \end{align} $$

where c represents the total effect of X on Y, a represents the relation between X and M, $c^{\prime }$ represents the direct effect of $X $ on Y after adjusting the effect of M, b represents the relation between M and Y after adjusting the effect of X, and the random errors $\epsilon _{j, i}, j=1, 2, 3$ , are independent of the corresponding regressors.

Figure 1 Causal diagram of the simple mediation model.

2.1 M-regression

To alleviate the effects of influential observations in the least-squares fitting, we adopt the M-regression to estimate the regression parameters, which can be regarded as a generalization of the maximum likelihood estimation as follows:

(4)

$$ \begin{align} (\hat{\beta}_{1}, \hat{c})^{T}&=\arg\min_{\beta_{1}, c}\sum_{i=1}^{n}\rho(Y_{i}-\beta_{1}-cX_{i}), \end{align} $$

(5)

$$ \begin{align} (\hat{\beta}_{2}, \hat{a})^{T}&=\arg\min_{\beta_{2}, a}\sum_{i=1}^{n}\rho(M_{i}-\beta_{2}-aX_{i}), \end{align} $$

(6)

$$ \begin{align} (\hat{\beta}_{3}, \hat{c^{\prime}}, \hat{b})^{T}&=\arg\min_{\beta_{3}, c^{\prime},b} \sum_{i=1}^{n}\rho(Y_{i}-\beta_{3}-c^{\prime}X_{i}-bM_{i}), \end{align} $$

where $\rho (\cdot )$ is the loss function with three properties: (i) non-negativity such that $\rho (\epsilon )\geq 0$ with $\rho (0)=0$ , (ii) symmetricity such that $\rho (\epsilon )=\rho (-\epsilon )$ , and (iii) monotonicity such that $\rho (\epsilon )\geqslant \rho (\epsilon ^{\prime })$ for any $|\epsilon |\geqslant |\epsilon ^{\prime }|$ .

Let $\psi (\epsilon ) = (d/d\epsilon )\rho (\epsilon )$ be the first derivative of the loss function, referred to as the influence curve. Let also $X=(X_{1}, \ldots , X_{n})^{T}$ , $M=(M_{1}, \ldots , M_{n})^{T}$ , $Y=(Y_{1}, \ldots , Y_{n})^{T}$ , $I=(1, \ldots , 1)^{T}$ , $\tilde {\boldsymbol {X}}=(I, X)$ , and $\check {\boldsymbol {X}}=(I, X, M)$ . For large samples, we further assume that $\boldsymbol {U}$ is the limiting matrix of $(n^{-1}\tilde {\boldsymbol {X}}^{T}\tilde {\boldsymbol {X}})^{-1}$ , and $\boldsymbol {V}$ is the limiting matrix of $(n^{-1}\check {\boldsymbol {X}}^{T}\check {\boldsymbol {X}})^{-1}$ . Then by Huber and Ronchetti (Reference Huber and Ronchetti2009), we have the following asymptotic normality for the M-estimators of the regression parameters.

Lemma 1. For the mediation model linked with (1)–(3), under the regularity conditions given on pages 163–164 of Huber and Ronchetti (Reference Huber and Ronchetti2009), the M-estimators in (4)–(6) are all normally distributed:

$$ \begin{align*}\sqrt{n}(\hat{c}-c)&\sim N\left(0, \frac{\mathrm{E}_{\epsilon_{1}}[\psi^{2}]}{(\mathrm{E}_{\epsilon_{1}}[\psi^{\prime}])^{2}} \boldsymbol{U}_{[2, 2]}\right), & \sqrt{n}(\hat{a}-a)&\sim N\left(0, \frac{\mathrm{E}_{\epsilon_{2}}[\psi^{2}]}{(\mathrm{E}_{\epsilon_{2}}[\psi^{\prime}])^{2}} \boldsymbol{U}_{[2, 2]}\right), \\ \sqrt{n}(\hat{c^{\prime}}-c^{\prime})&\sim N\left(0,\frac{\mathrm{E}_{\epsilon_{3}}[\psi^{2}]} {(\mathrm{E}_{\epsilon_{3}}[\psi^{\prime}])^{2}}\boldsymbol{V}_{[2, 2]}\right), & \sqrt{n}(\hat{b}-b)&\sim N\left(0, \frac{\mathrm{E}_{\epsilon_{3}}[\psi^{2}]} {(\mathrm{E}_{\epsilon_{3}}[\psi^{\prime}])^{2}}\boldsymbol{V}_{[3, 3]}\right). \end{align*} $$

Finally, based on the M-estimators in (4)–(6), we can define two new estimators of the indirect effect: one is the difference estimator $\hat {c}-\hat {c^{\prime }}$ and the other is the product estimator $\hat {a}\hat {b}$ .

2.2 Solution to M-regression

For a general loss $\rho (\cdot )$ , noting that the M-estimator may not have an explicit expression, a numerical solution is often required. To present our algorithm, we will focus only on (4) since the same algorithm can be extended to solve (5) and (6) as well. Differentiating the objective function $\sum _{i=1}^{n}\rho (Y_{i}-\beta _{1}-cX_{i})$ with respect to $\beta _{1}, c$ and setting the partial derivatives to be zero, it yields a system of two estimating equations as

$$ \begin{align*} \sum_{i=1}^{n}\psi(Y_{i}-\beta_{1}-cX_{i})&=0, \\ \sum_{i=1}^{n}\psi(Y_{i}-\beta_{1}-cX_{i})X_{i}&=0. \end{align*} $$

Further by introducing the weight function $w(e)=\psi (e)/e$ , the estimating equations can be rewritten as

$$ \begin{align*} \sum_{i=1}^{n}w_{i}\times(Y_{i}-\beta_{1}-cX_{i})&=0, \\ \sum_{i=1}^{n}w_{i}\times(Y_{i}-\beta_{1}-cX_{i})X_{i}&=0, \end{align*} $$

where $w_{i}=w(Y_{i}-\beta _{1}-cX_{i})$ . Solving these two equations is equivalent to minimizing

$$ \begin{align*} \sum_{i=1}^{n}w_{i}\times(Y_{i}-\beta_{1}-cX_{i})^2, \end{align*} $$

which is a weighted LS problem. Moreover, an iteratively reweighted least-squares (IRLS) algorithm can be appropriate to obtain the numerical solution of the regression coefficients, because the weights depend on the regression coefficients, and the regression coefficients in turn depend on the weights (Holland & Welsch, Reference Holland and Welsch1977). To also handle the multiple-minima problem, in case it has, we choose several different points in the parameter space as the initial estimates, in such a way to get a higher confidence to obtain the true global minimum (Green, Reference Green1984). More specifically, the IRLS algorithm for our problem is as follows.

2.3 Error conditions for model consistency

When is the product of parameters $ab$ equal to the difference in parameters $c-c^{\prime }$ in population? This is an important question in mediation analysis since it uncovers the relationship between the indirect, direct and total effects (Wang et al., Reference Wang, Yu, Zhou, Tong and Liu2023; Wang & Yu, Reference Wang and Yu2023; Yuan & MacKinnon, Reference Yuan and MacKinnon2014).

Note that the three regression equations, (1)-(3), are interrelated in the simple mediation model. By substituting (2) into (3), we have

(7)

$$ \begin{align} Y_{i}&=\beta_{3}+c^{\prime}X_{i}+b(\beta_{2}+aX_{i}+\epsilon_{2, i})+\epsilon_{3, i} \notag \\ &=(\beta_{3}+b\beta_{2})+(c^{\prime}+ab)X_{i}+\epsilon_{i}, \end{align} $$

where $\epsilon _{i}=b\epsilon _{2, i}+\epsilon _{3, i}$ . Assume that $\epsilon _{2, i}$ and $\epsilon _{3, i}$ are independent and symmetrically distributed with median $0$ , then $\epsilon _{i}$ is also symmetric with $\mathrm {Med}[\epsilon _i]=0$ (see Proposition 1 in Wang and Yu (Reference Wang and Yu2023)). In addition, let $\epsilon _{1,i}$ also be symmetrically distributed with $\mathrm {Med}[\epsilon _{1,i}]=0$ . Then by (1) and (7),

$$ \begin{align*} \mathrm{Med}[Y_{i}|X_{i}]&=\beta_{1}+cX_{i}+\mathrm{Med}[\epsilon_{1, i}|X_{i}], \\ \mathrm{Med}[Y_{i}|X_{i}]&=(\beta_{3}+b\beta_{2})+(c^{\prime}+ab)X_{i}+\mathrm{Med}[\epsilon_{i}|X_{i}]. \end{align*} $$

Noting also that the random errors are independent of the corresponding regressors as assumed in Section 2.1, we have $\mathrm {Med}[\epsilon _{1, i}|X_i]=\mathrm {Med}[\epsilon _{1, i}]=0$ and $\mathrm {Med}[\epsilon _{i}|X_{i}]=\mathrm {Med}[\epsilon _{i}]=0$ , and moreover,

$$ \begin{align*} \beta_{1}+cX_{i}\equiv (\beta_{3}+b\beta_{2})+(c^{\prime}+ab)X_{i}, \quad i=1,\ldots,n, \end{align*} $$

which further yields that $\beta _{1}=\beta _{3}+b\beta _{2}$ and $c=c^{\prime }+ab$ . Finally, by comparing (1) and (7), we also have $\epsilon _{i}=\epsilon _{1, i}$ . For convenience, we summarize the above result in Theorem 1.

Theorem 1. In the simple mediation model, given the independence of the errors and the corresponding regressors, we further assume that the errors are independent and symmetrically distributed with a unique median $0$ for $j=1, 2, 3$ . Then we have $ab=c-c^{\prime }$ , which builds an equality between the indirect effect, direct effect and total effect.

Remark 1. Many error distributions satisfy the error assumption in Theorem 1. For instance, when $\epsilon _{2, i}$ and $\epsilon _{3, i}$ are independent and normally distributed, Yuan and MacKinnon (Reference Yuan and MacKinnon2014) discussed the model consistency. Wang and Yu (Reference Wang and Yu2023) further discussed the consistency conditions for the LAD loss and obtained the similar equality as in Theorem 1.

2.4 Inference based on confidence interval

There are two estimators for the indirect effect: $\hat {c}-\hat {c^{\prime }}$ and $\hat {a}\hat {b}$ . Unlike the equivalence of the two LS estimators (MacKinnon et al., Reference MacKinnon, Warsi and Dwyer1995; Wang et al., Reference Wang, Yu, Zhou, Tong and Liu2023), the two M-estimators of the indirect effect for a general loss are not the same in general, that is, $\hat {a}\hat {b}\neq \hat {c}-\hat {c^{\prime }}$ . Simulation studies show that the product estimator is often more efficient than the difference estimator (see Appendix A0). Interestingly, the same conclusion can also be seen when the LAD loss is applied (Wang & Yu, Reference Wang and Yu2023). In view of this, we thus consider the null hypothesis $H_{0}: ab=0$ . To test whether $ab=0$ , there are two common methods in the literature including the parameter method (Sobel, Reference Sobel1982) and the nonparametric resampling method (MacKinnon et al., Reference MacKinnon, Lockwood and Williams2004; Preacher & Selig, Reference Preacher and Selig2012).

To move forward, our first method is to perform a robust Sobel test. Given the robust estimates $\hat {a}$ and $\hat {b}$ , we define the robust test statistic as

$$ \begin{align*} Z=\frac{\hat{a}\hat{b}}{\widehat{\mathrm{SE}}_{Sobel}}, \end{align*} $$

where $\widehat {\mathrm {SE}}_{Sobel}=\sqrt {\hat {a}^{2}\times \widehat {\mathrm {SE}}_{b}^{2}+\hat {b}^{2}\times \widehat {\mathrm {SE}}_{a}^{2}}$ , and $\widehat {\mathrm {SE}}_{a}$ and $\widehat {\mathrm {SE}}_{b}$ are the standard errors (SEs) of $\hat {a}$ and $\hat {b}$ , respectively. Following Theorem 1, the two SEs can be estimated by

$$ \begin{align*} \widehat{\mathrm{SE}}_{a}&=\left(\frac{n^{-1}\sum_{i=1}^{n}\psi^{2}(M_{i}-\hat{\beta}_{2}-\hat{a}X_{i}) [(\tilde{\boldsymbol{X}}^{T}\tilde{\boldsymbol{X}})^{-1}]_{[2, 2]}} {[n^{-1}\sum_{i=1}^{n}\psi^{\prime}(M_{i}-\hat{\beta}_{2}-\hat{a}X_{i})]^{2}}\right)^{1/2}, \\ \widehat{\mathrm{SE}}_{b}&=\left(\frac{n^{-1}\sum_{i=1}^{n}\psi^{2}(Y_{i}-\hat{\beta}_{3}-\hat{c^{\prime}}X_{i}-\hat{b}M_{i}) [(\check{\boldsymbol{X}}^{T}\check{\boldsymbol{X}})^{-1}]_{[3, 3]}} {[n^{-1}\sum_{i=1}^{n}\psi^{\prime}(Y_{i}-\hat{\beta}_{3}-\hat{c^{\prime}}X_{i}-\hat{b}M_{i})]^{2}}\right)^{1/2}. \end{align*} $$

Moreover, the normal-based $(1-\alpha )\%$ CI of $ab$ can be constructed as

$$ \begin{align*} [\hat{a}\hat{b}-z_{1-\alpha/2}\widehat{\mathrm{SE}}_{Sobel}, \: \hat{a}\hat{b}+z_{1-\alpha/2}\widehat{\mathrm{SE}}_{Sobel}], \end{align*} $$

where $\alpha $ is the significance level, and $z_{1-\alpha /2}$ represents the $(1-\alpha /2)$ quantile of the standard normal distribution. Note however that, when a and b are small, the sampling distribution of $\hat {a}\hat {b}$ may not be normal (MacKinnon et al., Reference MacKinnon, Lockwood and Williams2004; Wang et al., Reference Wang, Yu, Zhou, Tong and Liu2023). Thus to obtain an accurate CI, critical values of the distribution of $\hat {a}\hat {b}$ can be obtained by Monte Carlo simulation study (Meeker et al., Reference Meeker, Cornwell and Aroian1981; Meeker & Escobar, Reference Meeker and Escobar1994). In fact, one can easily obtain these critical values via inputting $\hat {a}$ , $\hat {b}$ , $\widehat {\mathrm {SE}}_{a}$ and $\widehat {\mathrm {SE}}_{b}$ into an R procedure medci() which was introduced by Tofighi and MacKinnon (Reference Tofighi and MacKinnon2011).

Our second method to construct CI is the bootstrap method based on resampling. The bootstrap method is nonparametric and robust in the sense that it does not need to estimate the SEs. First, we repeatedly resample the original dataset with replacement (Efron & Tibshirani, Reference Efron and Tibshirani1993); second, we estimate the indirect effect for each bootstrap sample using our proposed Huber method; third, we construct the CI by the percentile bootstrap (PRCT) as $[q_{\alpha /2}, q_{1-\alpha /2}]$ , where $q_{\alpha /2}$ is the $\alpha /2$ quantile of the empirical distribution of the indirect effect. To adjust and remove the potential estimation bias, the bias-corrected and accelerated bootstrap (BCa) is an important variation (Efron, Reference Efron1987; Efron & Tibshirani, Reference Efron and Tibshirani1993). In general, the BCa method can yield a more accurate CI than the PRCT method when the true parameter value is not the median of the distribution of the bootstrap estimates (MacKinnon et al., Reference MacKinnon, Lockwood and Williams2004).

3 Robust and efficient estimation via Huber loss

From a likelihood perspective, the best loss function would be the negative log-likelihood function (Schrader & Hettmansperger, Reference Schrader and Hettmansperger1980). Nevertheless, since the likelihood function is often unknown, one needs to specify an appropriate loss function in real applications. In this section, we study the robust and efficient estimation using the Huber loss with the optimal choice of tuning parameter. Note that our methodology is general and can also be extended to other loss functions.

3.1 Huber Loss

The Huber loss, as defined in Huber (Reference Huber1964), is given as

$$ \begin{align*} \rho_{H}(e)&=\left\{\begin{array}{ll} \frac{1}{2}e^{2}, & \text{if}\quad |e|\leq k, \\ k|e|-\frac{1}{2}k^{2}, &\text{if}\quad |e|>k, \end{array}\right. \\ \psi_{H}(e)&=\left\{\begin{array}{ll} e, & \text{if}\quad |e|\leq k, \\ k\times \text{sgn}(e), & \text{if}\quad |e|>k, \end{array}\right. \end{align*} $$

where $k>0$ is the tuning parameter. A smaller value of k produces more resistance to outliers, but at the expense of lower efficiency when the error is normal. For instance, by letting $k=1.345\sigma $ with $\sigma $ being the standard deviation of the error, it will yield a $95\%$ efficiency for the normal errors, which is also resistant to outliers with a breakdown point of 5.8%. Moreover, the standard deviation $\sigma $ can be estimated robustly by the median absolute deviation (MAD) as

$$ \begin{align*} \hat{\sigma}_{M\!A\!D}=\mathrm{Med}\{|e_{i}|\}/0.6745. \end{align*} $$

For any error $\epsilon $ , we denote $\tau =\sigma _{\psi }^{2}/B_{\psi }^{2}$ as the asymptotic variance of the Huber estimator (Huber, Reference Huber1964), where $\sigma _{\psi }^{2}=\mathrm {E}[\psi ^{2}(\epsilon )]$ and $B_{\psi }=\mathrm {E}[\psi ^{\prime }(\epsilon )]$ . We then minimize the $\tau $ value to determine the optimal $\rho (\cdot )$ . For the Huber loss with a given k, we have

$$ \begin{align*} B_{\psi}(k)&=\int_{-k}^{k}\mathrm{d}F(\epsilon), \\ \sigma_{\psi}^{2}(k)&=\int_{-k}^{k}\epsilon^{2}\mathrm{d}F(\epsilon)+k^{2}(1-B_{\psi}(k)), \end{align*} $$

where $F(\cdot )$ is the cumulative distribution function of $\epsilon $ .

Remark 2. As $k\rightarrow \infty $ , the Huber loss becomes the LS loss so that $\tau _{L\!S}=\sigma ^{2}$ , where $\sigma ^{2}$ is the variance of the error distribution. As $k\rightarrow 0$ , the Huber loss becomes the LAD loss so that $\tau _{L\!A\!D}=1/(4f(0)^{2})$ , where $f(0)$ is the density value of the error distribution at $0$ . Based on the observational data, the optimal tuning constant can be selected to obtain the smallest estimation variance. From this viewpoint, the Huber estimator is more efficient than its competitors when dealing with the unknown and complex error distributions.

3.2 Optimal Tuning constant

As is known, the tuning parameter k of the Huber loss can have a great impact on the estimation efficiency. When the error is normally distributed without contamination, the best choice of k is $\infty $ . On the other hand, when the error follows a heavy-tailed distribution such as the t distribution, then k tends to be a small value close to 0.

We adopt a numerical method proposed by Wang et al. (Reference Wang, Lin, Zhu and Bai2007) to select the optimal tuning constant, which minimizes the asymptotic variance of the estimator. For the Huber loss, the optimal k minimizes the efficiency factor $\tau $ with a three-step procedure as follows. First, we compute $\tau (k)$ for a range of k values, i.e., $0\le k\leq K$ by $0.001$ , where K is a positive number, e.g., $K=4$ . Second, we select the optimal k as

$$ \begin{align*} k_{opt}=\arg\min_{0<k\leq K}\tau(k). \end{align*} $$

Lastly, we compute the minimum value $\tau (k_{opt})$ . In Appendix B, we provide an R procedure to obtain the optimal tuning constant with a known error distribution.

For ease of reference, we also list the optimal $k_{opt}$ and $\tau (k_{opt})$ in Table 1 for some error distributions, including the standard normal distribution $N(0, 1)$ , the Laplace distribution Laplace $(0, 1)$ , the mixed normal distribution $0.9N(0, 1)+0.1N(0, \sigma ^{2})$ with $\sigma =3$ or $10$ , and the t distribution with $1$ or $2$ degrees of freedom. In general, the Huber loss with the optimal tuning parameter k is more efficient than the LS and LAD losses, since the less $\tau $ is, the more efficient the loss is. Moreover, to intuitively reflect the variation trend of $\tau (k)$ as k varies, we also plot the $\tau (k)$ function for a normal mixed and $t_{1}$ distributions in Figure 2. It is evident that the value of $\tau (k)$ varies dramatically along with the k value.

Figure 2 $\tau (k)$ is plotted for $0.9N(0, 1)+0.1N(0, 3^{2})$ (left) and $t_{1}$ (right). The corresponding red lines are $\tau (1.489)=1.296$ and $\tau (0.395)=2.278$ , respectively.

Table 1 Optimal k and $\tau (k)$ for various error distributions and loss functions

3.3 Nonparametric selection of Tuning constant

Following (4) and letting $e_{i}=Y_{i}-\hat {\beta }_{1}-\hat {c}X_{i}$ be the residuals, we propose to estimate $\tau $ nonparametrically by

(8)

$$ \begin{align} \hat{\tau}(k)=\frac{\hat{\sigma}_{\psi}^{2}(k)}{\hat{B}_{\psi}^{2}(k)}, \end{align} $$

where $\hat {\sigma }_{\psi }^{2}(k)=n^{-1}\sum _{i=1}^{n}\psi ^{2}(e_{i})$ and $\hat {B}_{\psi }(k)=n^{-1}\sum _{i=1}^{n}\psi ^{\prime }(e_{i})$ . More specifically for the Huber loss, we have

$$ \begin{align*} \hat{B}_{\psi}(k)&=\frac{1}{n}\sum_{i=1}^{n}\mathrm{{I}}(|e_{i}|\leq k),\\ \hat{\sigma}_{\psi}^{2}(k)&=\frac{1}{n}\sum_{i=1}^{n}\left\{e_{i}^{2} \mathrm{{I}}(|e_{i}|\leq k)+k^{2}\mathrm{{I}}(|e_{i}|>k)\right\}, \end{align*} $$

where $\mathrm {{I}}$ is the $0-1$ indicator function.

We propose a data-driven procedure that determines the optimal $\hat {k}$ by minimizing $\hat {\tau }(k)$ , which is, in fact, similar to Wang et al. (Reference Wang, Lin, Zhu and Bai2007) for a linear regression model with a scale parameter $\sigma $ . Our new procedure is summarized in Algorithm 2.

Note that in the algorithm, we have specified the maximum allowable k as $3\hat {\sigma }_{M\!A\!D}$ , which is often treated as sufficient since the probability that the errors fall within the interval $[-3\hat {\sigma }_{M\!A\!D}, 3\hat {\sigma }_{M\!A\!D}]$ is as large as 99.73% for the normal errors. To further investigate the performance of the proposed method on the selection of tuning constant k, we conduct a simulation study and report the results in Table B of the Appendix. When the sample size is large, the selected tuning constant k is very close to the theoretical one. Moreover, we note that the standard deviation of the tuning constant decreases dramatically as the sample size increases. These findings coincide with the conclusion in Wang et al. (Reference Wang, Lin, Zhu and Bai2007).

4 Simulation studies

Two simulation studies are carried out to evaluate the performance of the proposed method. Simulation A compares the efficiency of the three estimators based on the LS, LAD, and Huber losses under various designs, and Simulation B evaluates their type I error rate and power. For the simulation settings, we follow Yuan and MacKinnon (Reference Yuan and MacKinnon2014) and Wang and Yu (Reference Wang and Yu2023) and set $\beta _2=\beta _{3}=0$ , $c^{\prime }=1$ , and $a=b=0.14, 0.39, 0.59$ . Moreover, the sample size is set at $n=50, 200, 1000$ , corresponding to the small, medium and large samples, and four error distributions will be considered including $N(0, 1)$ , Laplace $(0, 1)$ , $0.9N(0, 1)+0.1N(0, 10^{2})$ , and $t_{2}$ .

For each simulated dataset, we estimate the regression parameters based on the LS, LAD, and Huber losses, and apply the product $\hat a \hat b$ to estimate the indirect effect. Then with 1,000 simulations for each setting, we compute the mean square error (MSE) to assess the estimation accuracy as follows:

$$ \begin{align*} \mathrm{MSE}[\hat{a}\hat{b}]=\frac{1}{1000}\sum_{i=1}^{1000}(\hat{a}\hat{b}-ab)^{2}. \end{align*} $$

Moreover, we apply the type I error rate and the statistical power to assess the performance of the LS, LAD, and Huber estimators for testing $H_0: ab=0$ . We use the robust Sobel test (Sobel Z), the percentile bootstrap (PRCT), and the BCa methods to construct the CIs. The type I error rate denotes the probability of incorrectly rejecting the null hypothesis when it is actually true, whereas the statistical power refers to the probability correctly rejecting the null hypothesis when the alternative hypothesis is true. A good testing procedure should control the type I error rate and, meanwhile, it also maximizes the power as much as possible. In practice, the empirical type I error rate (or power) is calculated as the proportion of CIs that do not contain zero when the indirect effect does not exist (or exists).

4.1 Efficiency of the LS, LAD, and Huber estimators

The MSE( $\times 10^3$ ) and standard deviation (SD $\times 10^3$ ) of the LS, LAD, and Huber estimators are presented in Table 2 for various designs. Comparing the MSE of the three estimators, we have two main findings. First, the MSE and SD of the three estimators decrease as the sample size increases. Second, the MSE and SD of the Huber estimator are always the smallest or close to the smallest. When the error follows $N(0,1)$ (or Laplace $(0,1)$ ), the LS (or LAD) estimator provides the optimal estimation. In these two cases, the Huber estimator performs very close to the performance of the optimal estimator. While for $ 0.9N(0,1)+0.1N(0,10^2) $ and $ t_2 $ , the MSE of the Huber estimator is the smallest among the three estimators. To conclude, the Huber estimator is more efficient than the LAD estimator when the error distribution is normal, and is more robust than the LS estimator when the error distribution is non-normal.

Table 2 MSE ( $\times 10^3$ ) and SD ( $\times 10^3$ labeled below MSE) for the LS, LAD, and Huber estimators

Note: Note that the bold font indicates the samllest MSE among the three estimators under one set of experimental conditions.

4.2 Type I error rate and power

We now apply the Sobel Z, PRCT and BCa methods to construct the 95% CI. Note that the medium effect sizes ( $a=b=0.39$ ) will yield a high power even when the sample size is moderate ( $n=200$ ). Thus to save space, we omit the simulation for the large effect size.

Table 3 report the type I error rates of the three estimators under various designs. When the sample size is large, i.e., $n=1000$ , we note that the type I error rates of the LS, LAD, and Huber estimators are all controlled in most cases. One exception is the LS estimator with the CIs constructed by the BCa method, which was also observed by Fritz et al. (Reference Fritz, Taylor and MacKinnon2012) with an explanation that the increased type I error rate is a function of an interaction between the nonzero effect size and the sample size. Another notable situation is that the type I error rate of the Huber loss Sobel test is slightly too high for the mixed normal and $t_2$ under the small and moderate sample sizes. Possible reasons can be, e.g., the standard error used for the Sobel test $\widehat {\mathrm {SE}}_{Sobel}$ is affected by the Optimizer’s curse (Smith & Winkler, Reference Smith and Winkler2006), and/or there is a potential gap between the optimal tuning constant and the one determined by Algorithm 2 in the small sample size. In Appendix E, we have also conducted another simulation study to assess their effect on the standard error used for the Sobel test. The results indicate that the $\widehat {\mathrm {SE}}_{Sobel}$ is indeed influenced by the optimizer’s curse, whereas its effect will diminish as the sample size increases. At the same time, the Huber estimator with the fixed $k=1.345$ performs better than the Huber estimator with the selected tuning constant (Huber-SEL) in the case of small sample size. Observing this, when the Huber-SEL estimator fails to yield satisfactory results, we suggest to take a moderate tuning constant, i.e., $k=1.345$ , as an alternative.

Table 3 Type I error rates (%) of the LS, LAD and Huber estimators for various designs

Note: Note that the bold font indicates the excessive type I error rate which exceeds 6.8% since with 1000 independent simulation runs, the type I error rate of a test with level 0.05 is expected lie in the interval $[2.3\%,6.8\%]$ with probability 0.99, using the normal approximation.

Following the same designs, we report the power of the three estimators in Table 4. For the normal errors, it is evident that the LS estimator not only controls the type I error rate but also achieves the highest power among the three estimators. Nevertheless, for the non-normal errors, the LS estimator is notably lacking in statistical power especially for the mixed normal distribution (e.g., $a=b=0.14$ , $0.9N(0,1)+0.1N(0,10^2)$ , and $n=1000$ ). In addition, despite that the LAD estimator is the most robust method with respect to the outliers, it however suffers from the efficiency loss and consequently yields a lower power (e.g., $a=b=0.14$ , $N(0,1)$ , and $n=1000$ ). In contrast, the Huber estimator makes a trade-off between the efficiency and robustness, in which its power is close to the largest and, meanwhile, it also controls the type I error rate below 5% regardless of the error distribution.

Table 4 Power (%) of the LS, LAD and Huber estimators for various designs

Note: Note that the bold font indicates the maximal empirical power among the three estimators under one set of experimental conditions.

5 Real data analysis

In this section, we conduct two real data analyses to illustrate the usefulness of the proposed method. Both the studies show that our newly method can provide a more efficient estimation than the existing competitors for mediation analysis. To promote the practical application, we have also made the R code publicly available on GitHub at https://github.com/pxj66/REMA.git.

5.1 Pathways to desistance study

Our first study is to uncover the causal mechanisms between mental health and violent offending among serious adolescent offenders (Kim et al., Reference Kim, Harris and Lee2024). In criminology, one possible mechanism is that individuals with mental health issues may be more likely to experience victimization, and this, in turn, may lead to their committing a serious crime. Our data comes from the Pathways to Desistance (PTD) study, which consists of 1354 serious juvenile offenders in two sites, including the Maricopa County in Arizona (N = 654) and Philadephia County in Pennsylvania (N = 700), over the years from 2000 to 2010 (Mulvey et al., Reference Mulvey, Schubert and Piquero2013). Focusing on the data of baseline interviews, our study contains a total of 1195 respondents after the data cleansing.

Consider the linear mediation model,

$$ \begin{align*} \textit{Expvic}_i &= \beta_{2} + a \textit{Health}_i + \boldsymbol{\delta}_{1}^T\boldsymbol{Z}_i + \varepsilon_{2,i}\\ \textit{Offend}_i &= \beta_{3} + c^\prime \textit{Health}_i + b \textit{Expvic}_i + \boldsymbol{\delta}_{2}^T\boldsymbol{Z}_i+\varepsilon_{3,i} \end{align*} $$

where Health (mental health) is the independent variable, Expvic (experienced victimization) is the mediating variable, Offend (violent effending) is the response variable. In addition, $\boldsymbol {Z}$ denotes the matrix of other controlled variables including age, gender, enthnicity, family structure, parental warmth, alcohol, marijuana, gang membership, parental hostility, and unsupervised routine activities. We summarize the type and the measure of these variables in Appendix F.

To assess the normality assumption for the errors, we compute the skewness and kurtosis of the residuals of y after regressing on x and m and the residuals of m after regressing on x, and then report them in Table 5. These values, together with the Kolmogorov–Smirnov (KS) test, clearly suggest a violation of the normality assumption. In view of this, we thus apply our new method to this dataset and also compare it with the existing methods for mediation analysis. Table 6 reports the indirect effects and the 95% CIs constructed by the Sobel Z, PRCT and BCa methods. From the results, we note that the three estimators produce similar and statistically significant indirect effects, whereas the Huber estimator yields the shortest CI.

Table 5 Skewness and kurtosis of two regression residuals and the Kolmogorov-Smirnov test for the pathways to desistance study-

Table 6 The indirect effect estimates and their 95% CIs based on the LS, LAD and Huber estimators for the pathways to desistance study

5.2 Action planning study

Our second study is to investigate the relationship between action planning and physical activity. In psychology, it is known that the action planning can promote the physical activity, yet the underlying mechanism between them is often unclear. To explore it, an illustrative study has recently been conducted to investigate the action planning promoting the physical activity mediated by the automaticity (Maltagliati et al., Reference Maltagliati, Sarrazin, Isoard-Gautheur, Pelletier, Rocchi and Cheval2023), in which a total of 135 participants over 18 years from the tertiary industry were recruited. Participants were asked to wear an accelerometer Actigraph GT3X+, which records their physical activity behaviors and the time of these activities on a notebook for a total of seven days. More specifically in their study, the action planning is the independent variable, measured by four-item Likert scales ranging from 1 (completely disagree) to 6 (full agree). And the automaticity is the mediating variable, measured by four-item of Self-Reported Habit Index ranging from 1 (strongly disagree) to 7 (strongly agree).

Consider the linear mediation model,

(9)

$$ \begin{align} \textit{Auto}_i &= \beta_{2} + a \textit{Plan}_i + \delta_1 \textit{Sex}_i + \delta_2 \textit{BMI}_i + \delta_3 \textit{Ill}_i + \varepsilon_{2,i}, \end{align} $$

(10)

$$ \begin{align} \textit{PA}_i &= \beta_{3} + c^{\prime}\textit{Plan}_i + \delta_4 \textit{Sex}_i + \delta_5 \textit{BMI}_i + \delta_6 \textit{Ill}_i + b \textit{Auto}_i + \varepsilon_{3,i}, \end{align} $$

where Auto, Plan, Sex, BMI, Ill, and PA represent the automaticity, action plan of exercise, gender, body mass index, illness, and physical activity of the respondent, respectively.

To assess the normality assumption for the errors, we also compute the skewness and kurtosis of the two residuals, and then report them in Table 7. These values, together with the KS test, suggest a serious violation of the normality assumption for the $y-m,x$ regression residuals. Based on this, we also apply the proposed method to the dataset and then report the result in Table 8. First of all, the three methods produce positive indirect effects from 0.6600 to 0.7594. While for the CIs, only the LAD method shows insignificant outcome in the PRCT CI. At the same time, the Huber loss also yields the shortest CI among the three methods.

Table 7 Skewness and kurtosis of two regression residuals and the Kolmogorov–Smirnov test in action planning study

Table 8 The indirect effect estimates and their 95% CIs based on the LS, LAD, and Huber losses for the action planning study

6 Discussion

This article proposed a novel M-regression for mediation analysis that minimizes the Huber loss function with the optimal tuning constant. The Huber loss can produce a more robust estimator compared to the LS loss when facing outliers and non-normal data, and on the other hand, it can produce a more efficient estimator compared to the LAD loss. Moreover, since the M-estimator may not have an explicit expression for a general loss function, we further proposed an IRLS algorithm for obtaining the numerical solutions. Under some mild conditions on the error distribution, the consistency of the mediation model was also established. Lastly, simulation studies and real data analysis showed that the Huber estimator has a better performance than the LS and LAD estimators.

In the literature, there are two methods commonly used to improve the estimation efficiency. The first method is the M-regression by selecting an optimal loss function from the loss function family. Besides the Huber loss that is among the most commonly used, other popular loss functions include, but not limited to, the Hampel loss (Hampel et al., Reference Hampel, Ronchetti, Rousseeuw and Stahel1986), the generalized Gauss-weight and linear quadratic losses (Koller & Stahel, Reference Koller and Stahel2011), and other general losses (Barron, Reference Barron2019; Tukey, Reference Tukey1977). When the error distribution is skewed, it is appropriate to adopt the asymmetric Huber and Tukey’s biweight losses for enhancing the estimation efficiency. In the field of microeconomics, the M-regression is done by solving the estimating equations which can be incorporated in the generalized method of moments (GMM), as also introduced in Chapter 6 of Cameron and Trivedi (Reference Cameron and Trivedi2005). By making some additional moment conditions, one can obtain more efficient estimators. The second method is to combine the information of quantiles for improving the estimation efficiency, i.e., the composite quantile regression (Zou & Yuan, Reference Zou and Yuan2008), the weighted quantile average regression (Zhao & Xiao, Reference Zhao and Xiao2014), and the combination of difference and robust methods (Wang et al., Reference Wang, Yu, Lin and Tong2019). Hence as a further direction, it can be of interest to investigate whether the estimation efficiency and power of our new method can be further improved.

Acknowledgements

We thank the editor, associate editor, and three anonymous referees for their valuable comments. This research is partially supported by the National Natural Science Foundation of China (No. 12071248) and Shandong Provincial Natural Science Foundation (No. ZR2024MA058).

Appendix A Comparing the product and difference estimators

To compare the efficiency of the product and difference estimators, we follow the same simulation design as that for Simulation A in Section 4. Table A shows the MSE ( $\times 10^3$ ) and SD ( $\times 10^3$ ) of the product and difference estimators based on the Huber loss. It is evident that the MSE and SD of the product estimator are smaller than those of the difference estimator. We hence recommend to adopt the product estimator for the subsequent hypothesis testing.

Table A MSE ( $\times 10^3$ ) and SD ( $\times 10^3$ ) for the product and difference estimators based on the Huber loss

Appendix B An R procedure for selecting of the Tuning constant

From the likelihood perspective, the optimal loss function is given as LS (or LAD) when the error distribution is Normal (or Laplace). Incorporating the relationship of the Huber loss with the LS and LAD losses, the optimal tuning constant is $\infty $ (or 0) for the Normal (or Laplace) distribution. For other error distributions, the optimal tuning constant minimizes the asymptotic variance of the Huber estimator. More specifically, we can compute $\hat {\tau }(k)=\hat {\sigma }_\psi ^2/\hat {B}_\psi ^2$ with a sequence of k, and then the optimal k, which corresponding to the minimum value of $\hat {\tau }$ , can be located. In what follows, we provide the R code for two examples, one for the mixed normal distribution $0.9N(0,1)+0.1N(0,3^2)$ and the other for the $t_1$ distribution.

Table B The values of Mean, SD and Median for the selected tuning constant

Appendix C Selection of the Tuning constant

To evaluate the performance of Algorithm 2, we follow the same simulation design as that for Simulation A in Section 4. Table B presents the Mean, SD, and Median of the selected tuning constant. As the sample size increases, the k values are very close to those from the theoretical results. This shows that Algorithm 2 provides a good performance for selecting the tuning constant for practical use. These findings also coincide with the conclusion in Wang et al. (Reference Wang, Lin, Zhu and Bai2007). Note that $ k_1 $ and $ k_2 $ correspond to the chosen tuning constant from Equations (2) and (3), respectively. In practice, when the value of $ k $ is small, the value of the efficiency factor $ \tau $ is very unstable. So we set the k value ranging from 0.2 to 3 $\hat \sigma _{M\!A\!D}$ by 0.01.

Appendix D Asymptotic relative efficiency of the Huber estimator

We first prove that the Huber loss with $k=1.345$ produces a 95% efficiency for the normal errors. We focus on Equation (1): $Y=\beta _1+cX+\epsilon _1,$ where $\epsilon _1\sim N(0,\sigma ^2)$ . When $k=k_0$ , the efficiency factor of the Huber estimator is computed by $\tau _H=\sigma _\psi ^2/B_\psi ^2$ , where

$$ \begin{align*} B_\psi(k_0)&=\int_{-k_0}^{k_0}\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{x^2}{2\sigma^2}\right\}dx =\Phi\left(\frac{k_0}{\sigma}\right)-\Phi\left(-\frac{k_0}{\sigma}\right),\\ \sigma_\psi^2(k_0)&=\int_{-k_0}^{k_0}\frac{x^2}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{x^2}{2\sigma^2}\right\}dx+k_0^2\big[1-B_\psi(k_0)\big]\\ &=\int_{-k_0/\sigma}^{k_0/\sigma}\frac{\sigma^2x^2}{\sqrt{2\pi}}\exp\left\{-\frac{x^2}{2}\right\}dx + k_0^2\big[1-B_\psi(k_0)\big]\\ &=\sigma^2\left\{G\left(\frac{k_0}{\sigma}\right)-G\left(-\frac{k_0}{\sigma}\right)+\Phi\left(\frac{k_0}{\sigma}\right)-\Phi\left(-\frac{k_0}{\sigma}\right)\right\} + k_0^2\big[1-B_\psi(k_0)\big], \end{align*} $$

with $G(x)=-x(\sqrt {2\pi })^{-1}\exp \{-x^2/2\}$ and $\Phi (x)$ being the cumulative distribution function of the standard normal distribution. Then

$$ \begin{align*} \tau_{H}=\frac{\sigma_\psi^2(1.345\sigma)}{B_\psi^2(1.345\sigma)}=\frac{0.7101645\sigma^2}{0.6746565}=1.052361\sigma^2 \end{align*} $$

and so

$$ \begin{align*} \frac{\tau_{LS}}{\tau_{H}}=\frac{\sigma^2}{1.05236\sigma^2}=0.9500003. \end{align*} $$

This shows that the asymptotic relative efficiency of the Huber estimator relatived to the LS estimator is 95% (Serfling, Reference Serfling2001).

At the same time, using Equation (4.52) on page 84 in Huber and Ronchetti (Reference Huber and Ronchetti2009), we have

$$ \begin{align*} \frac{2\phi(k)}{k}-2\Phi(-k)=\frac{\varepsilon}{1-\varepsilon}, \end{align*} $$

where $\phi = \Phi ^\prime $ is the probability density function of the standard normal distribution. This implies that, when $k=1.345$ , the Huber estimator is resistant to outliers with a breakdown point of $\varepsilon =5.8\%$ .

Appendix E Simulation study for Huber loss Sobel test

We conduct a new simulation to investigate the effect of the optimizer’s curse on the standard error used for the Sobel test, which is denoted by $\widehat {\mathrm {SE}}_{Sobel}$ . To achieve this, we consider various tuning constants as alternatives, specifying the number of alternatives as 6, 30, and 291. These numbers correspond to step lengths of 0.5, 0.1, and 0.01, respectively, within the tuning constants’ value range of $[0.1, 3]$ . Concentrating on the type I error rates, we set the sample size to be 50, 200, 1,000, or 2,000. We also employ the same true values for the regression parameters and the error distributions as those specified in Section 4. Under 1,000 simulated experiments, we then compute the mean standard error used for the Sobel test ( $\widehat {\mathrm {SE}}_{Sobel}$ ) for the Huber loss with the selected tuning constant (Huber-SEL) under the different alternatives, the Huber loss with the fixed tuning constant (Huber-FIX), and the Huber loss with the optimal tuning constant (Huber-OPT).

Table E shows the mean standard error used for the Sobel test ( $\widehat {\mathrm {SE}}_{Sobel}$ ) of the Huber-SEL, Huber-FIX, and Huber-OPT estimators under various designs. First of all, the simulation results reveal that the Huber-SEL estimator is indeed affected by the optimizer’s curse, yet this effect diminishes as the sample size increases. For example, let us look at the change in $\widehat {\mathrm {SE}}_{Sobel}$ between different number of alternatives. With $a=0,b=0.14$ , $N(0,1)$ , and $n=50$ , the $\widehat {\mathrm {SE}}_{Sobel}$ of the Huber-SEL estimator exhibits a gradual decline away from the $\widehat {\mathrm {SE}}_{Sobel}$ of the Huber-OPT estimator, i.e., 31.50, with the values shifting from 27.90 to 27.66, and then to 26.95, as the number of alternatives increases. But when the sample size is 1,000 or 2,000, these values are all close to the optimal value. Secondly, we found that the Huber estimator with the fixed $k=1.345$ performs better than the Huber-SEL estimator in the case of small sample sizes. However, as the sample size increases, the Huber-SEL estimator is more close to the Huber-OPT estimator than the Huber-FIX estimator. Combined with the conclusions drawn from Table B, two plausible explanations for the poor performance of the Huber loss Sobel tests in $n=50$ or $n=200$ are that there is a potential gap between the optimal tuning constant and the one determined by Algorithm 2 and the $\widehat {\mathrm {SE}}_{Sobel}$ is influenced by the optimizer’s curse. For practical applications, when the Huber-SEL estimator fails to yield satisfactory results, we suggest to take a moderate tuning constant, i.e., $k=1.345$ , as an alternative.

Table E Mean standard error ( $\times 10^3$ ) used for the Sobel test for the Huber-SEL, Huber-FIX and Huber-OPT estimators under various designs

Appendix F Measures of interesting variables in the PTD study

Table F Measures of interesting variables in the pathways to desistance study

References

Allison, P. D. (1995). The impact of random predictors on comparisons of coefficients between models: Comment on Clogg, Petkova, and Haritou. American Journal of Sociology, 100, 1294–1305.CrossRef Google Scholar

Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182.CrossRef Google Scholar PubMed

Barron, J. T. (2019). A general and adaptive robust loss function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4331–4339.CrossRef Google Scholar

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications. Cambridge University Press.CrossRef Google Scholar

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82, 171–185.CrossRef Google Scholar

Efron, B., & Tibshirani, R. J. (1993). An introduction to the Bootstrap. Chapman & Hall.CrossRef Google Scholar

Field, A. P., & Wilcox, R. R. (2017). Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behaviour Research and Therapy, 98, 19–38.CrossRef Google Scholar PubMed

Freedman, L. S., & Schatzkin, A. (1992). Sample size for studying intermediate endpoints within intervention trials of observational studies. American Journal of Epidemiology, 136, 1148–1159.CrossRef Google Scholar PubMed

Fritz, M. S., Taylor, A. B., & MacKinnon, D. P. (2012). Explanation of two anomalous results in statistical mediation analysis. Multivariate Behavioral Research, 47, 61–87.CrossRef Google Scholar PubMed

Frölich, M., & Huber, M. (2017). Direct and indirect treatment effects-causal chains and mediation analysis with instrumental variables. Journal of the Royal Statistical Society: Series B, 79, 1645–1666.CrossRef Google Scholar

Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society: Series B, 46, 149–192.CrossRef Google Scholar

Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (1986). Robust statistics: The approach based on influence function. Wiley.Google Scholar

Hansen, B. (2022). Econometrics. Princeton University Press.Google Scholar

Hayes, A. F. (2023). Introduction to mediation, moderation, and conditional process analysis: A regression-based approach. (3rd ed.) Guiford Press.Google Scholar

Holland, P. W., Welsch, R. E. (1977). Robust regression using iteratively reweighted least-squares. Communications in Statistics—Theory and Methods, 6, 813–827.CrossRef Google Scholar

Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35, 73–101.CrossRef Google Scholar

Huber, P. J., & Ronchetti, E. M. (2009). Robust statistics. (2nd ed.) Wiley.CrossRef Google Scholar

Kenny, D., Kashy, D., Bolger, N. (1998). Data analysis in social psychology. In Gilbert, D. T., Fiske, S. T., & Lindzey, G. (Eds.), The handbook of social psychology (pp. 233–265). McGraw-Hill.Google Scholar

Kim, J., Harris, M. N., & Lee, Y. (2024). The relationships between mental health and violent offending among serious adolescent offenders: An examination of the mediating role of experienced and witnessed victimization. Crime & Delinquency, 70, 2622–2646.CrossRef Google Scholar

Koller, M., & Stahel, W. A. (2011). Sharpening Wald-type inference in robust regression for small samples. Computational Statistics & Data Analysis, 55, 2504–2515.CrossRef Google Scholar

Lachowicz, M. J., Preacher, K. J., Kelley, K. (2018). A novel measure of effect size for mediation analysis. Psychological Methods, 23, 244–261.CrossRef Google Scholar PubMed

Lockhart, G., MacKinnon, D. P., & Ohlrich, V. (2011). Mediation analysis in psychosomatic medicine research. Psychosomatic Medicine, 73, 29–43.CrossRef Google Scholar PubMed

MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. Taylor & Francis Group.Google Scholar

MacKinnon, D. P., Lockwood, C. M., & Williams, J. (2004). Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate Behavioral Research, 39, 99–128.CrossRef Google Scholar PubMed

MacKinnon, D. P., Lockwood, C., & Hoffman, J. (1998). A new method to test for mediation. In The annual meeting of the society for prevention research, Park City, USA.Google Scholar

MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7, 83–104.CrossRef Google Scholar PubMed

MacKinnon, D. P., Warsi, G., & Dwyer, J. H. (1995). A simulation study of mediated effect measures. Multivariate Behavioral Research, 30, 41–62.CrossRef Google Scholar PubMed

Maltagliati, S., Sarrazin, P., Isoard-Gautheur, S., Pelletier, L., Rocchi, M., & Cheval, B. (2023). Automaticity mediates the association between action planning and physical activity, especially when autonomous motivation is high. Psychology & Health, in press. https://doi.org/10.1080/08870446.2023.2188886.Google Scholar PubMed

Meeker, W. Q., Cornwell, L. W., & Aroian, L. A. (1981). Selected table in mathematical statistics (volume VII). The product of two normally distributed random variables. American Mathematical Society.Google Scholar

Meeker, W. Q., & Escobar, L. A. (1994). An algorithm to compute the cdf of the product of two normal random variables. Communications in Statistics—Simulation and Computation, 23, 271–280.CrossRef Google Scholar

Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166.CrossRef Google Scholar

Mulvey, E. P., Schubert, C. A., & Piquero, A. (2013). Pathways to desistance—Final technical report. Technical Report. National Institute of Justice. URL: https://www.ojp.gov/library/publications/pathways-desistance-final-technical-report.Google Scholar

Newland, R. P., Crnic, K. A., Cox, M. J., Mills-Koonce, W. R., & Investigators, F. L. P. K. (2013). The family model stress and maternal psychological symptoms: Mediated pathways form economic hardship to parenting. Journal of Family Psychology, 27, 96–105.CrossRef Google Scholar

Ogden, C. L., Carroll, M. D., Curtin, L. R., Lamb, M. M., & Flegal, K. (2010). Prevalence of high body mass index in U.S. children and adolescents, 2007-2008. Journal of the American Medical Association, 303, 242–249.CrossRef Google Scholar PubMed

Preacher, K. J. (2015). Advances in mediation analysis: A survey and synthesis of new developments. Annual Review of Psychology, 66, 825–852.CrossRef Google Scholar PubMed

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40, 879–891.CrossRef Google Scholar PubMed

Preacher, K. J., & Selig, J. P. (2012). Advantages of Monte Carlo confidence intervals for indirect effects. Communication Methods and Measures, 6, 77–98.CrossRef Google Scholar

Richiardi, L., Bellocco, R., & Zugna, D. (2013). Mediation analysis in epidemiology: Methods, interpretation and bias. International Journal of Epidemiology , 42, 1511–1519.CrossRef Google Scholar PubMed

Rucker, D. D., Preacher, K. J., Tormala, Z. L., & Petty, R. E. (2011). Mediation analysis in social psychology: Current practices and new recommendations. Social and Personality Psychology Compass, 5, 359–371.CrossRef Google Scholar

Schrader, R. M., & Hettmansperger, T. P. (1980). Robust analysis of variance based upon a likelihood ratio criterion. Biometrika, 67, 93–101.CrossRef Google Scholar

Serfling, R. J. (2001). Approximation theorems of mathematical statistics. John Wiley & Sons.Google Scholar

Smith, J. E., & Winkler, R. L. (2006). The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52, 311–322.CrossRef Google Scholar

Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology, 13, 290–312.CrossRef Google Scholar

Tofighi, D., & MacKinnon, D. (2011). RMediation: An R package for mediation analysis confidence intervals. Behavior Research Methods, 43, 692–700.CrossRef Google Scholar

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.Google Scholar

VanderWeele, T. J. (2015). Explanation in causal inference: Methods for mediation and interaction. Oxford University Press.Google Scholar

VanderWeele, T. J., & Tchetgen, E. J. T. (2017). Mediation analysis with time varying exposures and mediators. Journal of the Royal Statistical Society: Series B, 79, 917–938.CrossRef Google Scholar PubMed

Wang, W. W., & Yu, P. (2023). Nonequivalence of two least-absolute-deviation estimators for mediation effect. TEST, 32, 370–387.CrossRef Google Scholar

Wang, W. W., Yu, P., Lin, L., & Tong, T. (2019). Robust estimation of derivatives using locally weighted least absolute deviation regression. Journal of Machine Learning Research, 20, 1–49.Google Scholar

Wang, W. W., Yu, P., Zhou, Y., Tong, T., & Liu, Z. (2023). Equivalence of two least-squares estimators for mediation effect. Current Psychology, 42, 7364–7375.CrossRef Google Scholar

Wang, Y. G., Lin, X., Zhu, M., & Bai, Z. (2007). Robust estimation using the Huber function with a data-dependent tuning constant. Journal of Computational and Graphical Statistics, 16, 468–481.CrossRef Google Scholar

Yuan, Y., & MacKinnon, D. P. (2014). Robust mediation analysis based on median regression. Psychological Methods, 19, 1–20.CrossRef Google Scholar PubMed

Zhao, Z., & Xiao, Z. (2014). Efficient regression via optimally combining quantile information. Econometric Theory, 30, 1272–1314.CrossRef Google Scholar PubMed

Zou, H., & Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. The Annals of Statistics, 36, 1108–1126.CrossRef Google Scholar

Zu, J., & Yuan, Y. (2010). Local influence and robust procedures for mediation analysis. Multivariate Behavioral Research, 45, 1–44.CrossRef Google Scholar PubMed

Figure 1 Causal diagram of the simple mediation model.

Figure 2 $\tau (k)$ is plotted for $0.9N(0, 1)+0.1N(0, 3^{2})$ (left) and $t_{1}$ (right). The corresponding red lines are $\tau (1.489)=1.296$ and $\tau (0.395)=2.278$, respectively.