1 Introduction
Generalized linear mixed models (GLMM) play a pivotal role in psychometric data analysis due to their ability to handle clustered data structures such as longitudinal data when the response variable follows a non-normal distribution, e.g., repeated measures of a binary item across subjects (Cnaan et al., Reference Cnaan, Laird and Slasor1997; Tuerlinckx et al., Reference Tuerlinckx, Rijmen, Verbeke and De Boeck2006). Despite their popularity, important drawbacks are sensitivity to model violations and computational issues. Flexibility in dealing with a complex data structure is coupled with a strong reliance on the underlying parametric assumptions. Additionally, this flexibility poses challenges for researchers in specifying the fixed and random components within the model; this sensitive issue has generated a lively discussion within the scientific community (see, for examples Barr et al., Reference Barr, Levy, Scheepers and Tily2013; Bates, Kliegl, et al., Reference Bates, Kliegl, Vasishth and Baayen2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017).
As a prototypical example, in this paper, we make use of the dataset on follow-up reports of Italian children adopted from foreign countries presented in Santona et al. (Reference Santona, Tognasso, Miscioscia, Russo and Gorla2022). The dataset comprises
$527$
children. The follow-up questionnaires are administered to children at different times (depending on the regulations of the foreign countries) and range from
$4$
to
$6$
follow-ups, for a total of
$3116$
observations. The health status is measured through the analysis of several questions. The research aim is to explore the adaptation of adopted children to their new family and the social environment during the early stages of adoption, as well as how this adaptation evolves over time. In this analysis, we will focus on the presence or absence of health issues (i.e., a binary variable). The within-subject variable is then the month from the adoption, while other subject-specific socio-demographic variables (i.e., gender, age, country of origin) are also included in the model.
The fitting of a GLMM with a logistic link function and a binomial response reveals initial difficulties. One of them is that the default optimization method fails to converge, leading to unreliable model parameter estimates. Using alternative constrained optimization methods based on quadratic approximation is necessary by setting a high threshold for the number of times the function can be evaluated within the algorithm, which leads to a large increase in computational time and space complexity. In addition, as mentioned before, the specification of the GLMM is not so straightforward. The researcher must decide which variables are considered as random effects and which type of random effect must be inserted into the model. These decisions directly impact the correctness of statistical inference on the coefficients, and the proper choice between a parsimonious model (Bates, Kliegl, et al., Reference Bates, Kliegl, Vasishth and Baayen2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017) and a maximal model (Barr et al., Reference Barr, Levy, Scheepers and Tily2013) is still an open problem.
In order to deal with these problems, this paper presents a robust approach for coping with inference on binary responses with a complex correlation structure, e.g., longitudinal data. The method can deal with any dependent variable following a generalized linear model (GLM) (Agresti, Reference Agresti2015) under the usual regularity condition (Azzalini, Reference Azzalini2017). As pointed out at the beginning of the introduction, problems related to the popular generalized linear mixed models are numerous and are summarized below.
First, generalized linear (mixed) models are frequently misspecified due to overdispersion, heteroscedasticity, or neglecting nuisance variables. In case of misspecification, traditional parametric tests (i.e., tests that exclusively depend on an assumed parametric model to compute the null distribution of the test statistic) may lose their validity as they rely on estimating the Fisher information under incorrect assumptions (Heagerty & Kurland, Reference Heagerty and Kurland2001). An alternative approach for testing regression coefficients in potentially misspecified GLMs is to employ a Wald-type test with a robust variance estimator (i.e., the sandwich estimator). However, for small to moderate sample sizes, sandwich estimates are rather inaccurate, leading to overly liberal tests (Hemerik et al., Reference Hemerik, Goeman and Finos2020). Moreover, existing quasi-likelihood methods (e.g., Generalized Estimating Equation (GEE), which can be defined as a generalization of the GLM to longitudinal data (Carlin et al., Reference Carlin, Wolfe, Coffey and Patton1999; Diggle et al., Reference Diggle, Liang and Zeger1994; Laird & Ware, Reference Laird and Ware1982; Liang & Zeger, Reference Liang and Zeger1986)) for testing in misspecified models often fail to provide satisfactory control over the type I error rate as underlined by Hemerik et al. (Reference Hemerik, Goeman and Finos2020). Even when well-specified, these tests control type I errors only asymptotically. In particular, GEE handles the dependence between observations by specifying a working correlation matrix. The choice of this matrix significantly affects regression estimator efficiency (Crowder, Reference Crowder1995; Sutradhar & Das, Reference Sutradhar and Das2000; Wang & Carey, Reference Wang and Carey2003). The working correlation is barely known in practice, and choosing the identity (i.e., assuming independence between the repeated measurements) is a common choice to have consistent estimators. GEE fails to control type I error in small samples, unbalanced designs, endogenous covariates when assuming independence, and not randomly missing values (Hemerik et al., Reference Hemerik, Goeman and Finos2020; Vonesh & Chinchilli, Reference Vonesh and Chinchilli1996; Wang & Carey, Reference Wang and Carey2003).
Second, the estimation process of generalized linear mixed models, having no closed-form solution for the likelihood function, can encounter convergence issues, especially in the presence of complex random effect structures. The main factors contributing to these problems include high dimensionality, non-identifiability, singular design matrices, imbalanced and sparse data, and high sensitivity to starting values, even if strong assumptions are made to simplify the estimation process. Again, these lead to invalid inferences, increased computational time, and resource use.
In summary, model misspecification and convergence issues are important concerns that can impact the reliability of inferences. The robustness of the significance tests depends not only on the degree of agreement between the specified mathematical model and the actual population data structure but also on the precision and robustness of the computational criteria for fitting the specified covariance structure to the data.
To address these challenges, nonparametric methods, such as permutation tests, prove useful for testing the effects of covariates. It is well known that permutation theory has emerged as a promising alternative to traditional, parametric statistical approaches, providing more robust and reliable inference for various hypothesis testing scenarios (Pesarin, Reference Pesarin2001). Kherad-Pajouh and Renaud (Reference Kherad-Pajouh and Renaud2010) and Lee and Braun (Reference Lee and Braun2012) proposed nonparametric tests for linear mixed models. The first is based on permuted residuals, while the second is based on the best linear unbiased predictors and restricted likelihood ratio test statistic. However, both rely on the Gaussian assumption distribution of the outcome. Instead, Basso and Finos (Reference Basso and Finos2012) and Finos and Basso (Reference Finos and Basso2014) proposed a permutation test for the generalized linear mixed model based on the parametric statistical test. Basso and Finos (Reference Basso and Finos2012) and Finos and Basso (Reference Finos and Basso2014) fit a model for each cluster/subject, and then the parameter of interest is estimated and tested considering the within-subject estimated coefficients as responses. Basso and Finos (Reference Basso and Finos2012) deal with testing within-subject effect while Finos and Basso (Reference Finos and Basso2014) the between-subject one. The test proposed by Basso and Finos (Reference Basso and Finos2012) is exact but has low power since the power of the statistical test depends on the quality of estimating the covariance of the cluster-level parameter estimator. In contrast, the statistical test presented by Finos and Basso (Reference Finos and Basso2014) is exact only in the case of balanced number of within-subject observations. In addition, the approach of Finos and Basso (Reference Finos and Basso2014) heavily relies on an adequate estimate of the Fisher Information.
This paper investigates challenges and novel methodologies for testing regression coefficients in generalized linear mixed models. We rely on the resampling-based procedure recently proposed by Hemerik et al. (Reference Hemerik, Goeman and Finos2020) and De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) to solve the above-mentioned problems. We extend their approach, dealing with the within-subject dependence that characterizes longitudinal data and the heteroscedasticity between groups of observations. The test proposed by Hemerik et al. (Reference Hemerik, Goeman and Finos2020) computes effective scores and randomly sign-flips each subject’s contribution to the score. The method is improved in De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) by including an additional standardization step. This leads to a test that has excellent small-sample performance and is asymptotically valid under any variance misspecification. The method proposed here exploits the “two-stage summary statistics” concept, which is well-known in several applied fields such as neuroimaging (Helwig, Reference Helwig2019; Mumford & Poldrack, Reference Mumford and Poldrack2007) and clinical trials (Everitt, Reference Everitt1995; Frison & Pocock, Reference Frison and Pocock1992). The works of Basso and Finos (Reference Basso and Finos2012) and Finos and Basso (Reference Finos and Basso2014) mentioned before also involve using summary measures at the subject/cluster level. However, these two methods, as well as other permutation-based statistical tests (i.e., Kherad-Pajouh and Renaud, Reference Kherad-Pajouh and Renaud2010; Lee and Braun, Reference Lee and Braun2012), cannot deal with heteroscedasticity and unbalanced data. In addition, most of these methods depend on some parameter estimation (e.g., Fisher Information in Finos and Basso, Reference Finos and Basso2014) or specification of the random part of the model. In contrast, the approach proposed is free from assumptions concerning the random data structure.
The article is organized as follows. Section 2 outlines the sign-flipping test proposed by De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024). This approach is extended to clustered data in Section 3. Section 4 evaluates by simulations the performance of our method in comparison with the GLM, GLMM, and GEE testing in terms of type I error control and power. Finally, Section 5 presents our case study, highlighting how, thanks to the proposed approach, it is possible to provide robust inference for binary longitudinal data.
The method proposed is efficiently implemented in the R package jointest (Finos & Andreella, Reference Finos and Andreella2024), which is compatible with large datasets and complex statistical models. Along the manuscript, we will call the approach developed by Hemerik et al. (Reference Hemerik, Goeman and Finos2020) and improved by De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) the flipscores approach, while the proposed extension to deal with dependent observations as flip2sss (i.e., flipscores two-stage summary statistics). We also clarify here that when we use the term score, we refer to the score function derived from the likelihood of GLMs where only the fixed components are assumed. This should not be confused with scores obtained from item theory models.
2 Sign-flipping score test with independent observations
We introduce the flipscores method proposed by De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) for inference on regression coefficients in GLM, which improves Hemerik et al. (Reference Hemerik, Goeman and Finos2020)’s approach, particularly in the case of small sample size. The statistical test is based on randomly flipping the signs of scores (i.e., one score for each subject), which is robust to misspecification of the variance in the GLM framework. The notation used throughout the manuscript follows that used by De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024).
Consider n independent observations
$y_1, \dots , y_n$
following a GLM from the exponential dispersion family (Agresti, Reference Agresti2015):

where
$\theta _i$
is the canonical parameter,
$\phi _i$
the dispersion one, and
$\mu _i = \mathrm {E}(y_i) = b^\prime (\theta _i)$
. The generalized linear model is then defined as:

where
$g(\cdot )$
is the link function,
$\boldsymbol {\mu } = (\mu _1, \dots , \mu _n)^\top $
,
$\boldsymbol {X}$
is the design matrices corresponding to the vector of q parameters
$\boldsymbol {\beta }$
.
The main focus is to test the null hypothesis for a given element d of
$\boldsymbol {\beta }$
still accounting for all nuisances parameters, i.e.,
$H_0: \beta _d = \beta _0 \mid \beta _1,\ldots , \beta _{d-1}, \beta _{d+1},\beta _{q}, \phi _1, \dots , \phi _n$
. The case of one parameter of interest, i.e., one column of matrix
$\boldsymbol {X}$
, is then considered. However, the approach is extendable to the multivariate case (De Santis et al., Reference De Santis, Goeman, Hemerik and Finos2024; Hemerik et al., Reference Hemerik, Goeman and Finos2020). The test statistic used in Hemerik et al. (Reference Hemerik, Goeman and Finos2020) (with a notation slightly adapted to our context) is the effective score:

where
$\boldsymbol {V} = \mathrm {diag}\{\mathrm {var}(y_i)\}$
,
$\boldsymbol {W} = \mathrm {diag}\left \{ \dfrac {\partial \mu _i}{\partial \eta _i} \right \} V^{-1} \mathrm {diag}\left \{ \dfrac {\partial \mu _i}{\partial \eta _i} \right \}$
,
$\boldsymbol {H} = \boldsymbol {W}^{1/2} \boldsymbol {X}_{-d} (\boldsymbol {X}_{-d}^\top \boldsymbol {W} \boldsymbol {X}_{-d})^{-1} \boldsymbol {X}_{-d}^\top \boldsymbol {W}^{1/2}$
and
$\hat {\boldsymbol {\mu }} = (\hat {\mu }_1, \dots , \hat {\mu }_n)$
are the fitted values of the model under the null hypothesis.
$\boldsymbol {X}_{-d}$
indicates the matrix
$\boldsymbol {X}$
without the column d and
$\boldsymbol {X}_{d}$
the one containing only the column d. Since S is a score test,
$\boldsymbol {V}$
and
$\boldsymbol {W}$
are computed under the null hypothesis (i.e., assuming
$\beta _d=\beta _0$
), furthermore being
$\boldsymbol {W}$
a diagonal matrix, its square root is the element-wise square root.
To improve the small-sample reliability of the test in Hemerik et al. (Reference Hemerik, Goeman and Finos2020), De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) proposed the standardized version of (1), i.e.,
$S^\star = \dfrac {S}{\mathrm {var}\{S\}^{1/2}}$
where
$\text {var}\{S\}= n^{-1} \boldsymbol {X}_d^\top \boldsymbol {W}^{1/2} (\boldsymbol {I} - \boldsymbol {H}) \boldsymbol {W}^{1/2} \boldsymbol {X}_d + o_p(1)$
.
$S^\star $
results to be asymptotically valid under any variance misspecification. It is important to remark that the tests by Hemerik et al. (Reference Hemerik, Goeman and Finos2020) and De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) are asymptotically equivalent. In particular, they share the same robustness to variance misspecification. The test by De Santis et al. (Reference De Santis, Goeman, Hemerik and Finos2024) is, however, better for small sample size since it has better “level accuracy” (i.e., the level of the test tends to be closer to
$\alpha $
).
The statistic S in Equation (1) can be rephrased as a sum of n components, i.e., the effective score contributions. The associated p-value is computed by randomly flipping the signs of these score contributions. We write each sign-flipping transformation as an
$n \times n$
diagonal matrix
$\boldsymbol {F}$
, with diagonal elements that are
$-1$
or
$1$
with equal probability. Sign-flipping the score contribution means multiplying the effective score by a given sign-flip diagonal matrix
$\boldsymbol {F}$
, i.e.:

The corresponding standardized version (De Santis et al., Reference De Santis, Goeman, Hemerik and Finos2024) equals

where
$\text {var}\{S(\boldsymbol {F})\}= n^{-1} \boldsymbol {X}_d^\top \boldsymbol {W}^{1/2} (\boldsymbol {I} - \boldsymbol {H}) \boldsymbol {F} (\boldsymbol {I}-\boldsymbol {H}) \boldsymbol {F} (\boldsymbol {I}-\boldsymbol {H}) \boldsymbol {W}^{1/2} \boldsymbol {X}_d + o_p(1)$
.
Considering B independent sign flip transformations, where
$S_1^\star = S^\star (\boldsymbol {I})$
is the observed test statistic, we reject the null hypothesis
$H_0: \beta _d = \beta _0$
versus the alternative
$H_1: \beta _d> \beta _0$
at significance level
$\alpha $
if:

where
$S_{(1)}^\star \le S_{(2)}^\star \le \cdots \le S_{(B)}^\star $
are the sorted statistics and
$\lceil \cdot \rceil $
is the ceiling function. In the same way, we reject
$H_0$
versus
$H_1: \beta _d < \beta _0$
if
$S_1^\star < S_{(\lceil \alpha B \rceil )}^\star $
and versus
$H_1: \beta _d \ne \beta _0$
if
$S_1^\star < S_{(\lceil (\alpha /2) B \rceil )}^\star \cup S_1^\star> S_{(\lceil (1-\alpha /2) B \rceil )}^\star $
.
As said above, the method is easily extendable to nonscalar d (De Santis et al., Reference De Santis, Goeman, Hemerik and Finos2024; Hemerik et al., Reference Hemerik, Goeman and Finos2020). In short,
$\boldsymbol {X}_d$
becomes a matrix instead of a vector, the effective score test statistic
$S(\boldsymbol {F})$
defined in Equation (2) becomes a vector, and the
$\text {var}\{S(\boldsymbol {F})\}$
in Equation (3) is now a matrix. The test statistic for
$H_0: \boldsymbol {\beta } = \boldsymbol {\beta }_0$
, with
$\text {dim}(\boldsymbol {\beta })> 1$
, equals:

where
$\boldsymbol {M}$
is generally defined as the inverse of an estimate of the effective Fisher information of
$\boldsymbol {\beta }$
, i.e., Mahalanobis distance (Pesarin, Reference Pesarin2001). Note that we do not need to estimate the Fisher Information correctly due to the robustness of our method against variance misspecification.
Having, therefore, the null distribution of the score-based test statistic
$S^\star $
, the following section shows how to deal with the case of correlated longitudinal data. The proposed solution will heavily rely on the robustness of the standardized flipscores to variance misspecification since this will allow us to avoid modeling the random part of the mixed model.
3 Extension to the non-independent case
This section extends the test from Section 2 to the case where the dependent variable has a clustered structure, e.g., longitudinal data. To make the proposed solution more concrete, we briefly introduce the part of the data that will be analyzed in Section 5. Adopted children are measured longitudinally over Time, i.e., in different time points and different number of occasions. Among others, health issues are recorded; this is the response variable Unhealth (binary variable). Other variables in the model, such as Sex (binary variable), Age (of the child arrived in the family, in years), and Country (of origin), are constant within the subjects and play the role of between-cluster variables. A possible model of interest is the following: Unhealth
$\mathtt {\sim }$
1+ Country + Age + Sex + Time + Sex:Time.
Consider n observations
$y_1, \dots , y_n$
with a correlation structure dictated by the longitudinal data nature. We have then
$n_j$
observations in cluster (i.e., child) j with
$n = \sum _{j}^{N} n_j$
and N is the total number of clusters (i.e., children). We then assume that the observations within the clusters are dependent (e.g., repeated measurements for each subject). The exchangeability assumption used to compute the null distribution of the standardized score test statistic
$S^\star $
defined in Equation (3) does not hold.
The extension to the non-independent observations case proposed is based on the well-known “two-stage summary statistics” approach (Everitt, Reference Everitt1995; Frison & Pocock, Reference Frison and Pocock1992; Helwig, Reference Helwig2019; Mumford & Poldrack, Reference Mumford and Poldrack2007). In short, the “two-stage summary statistics” approach reduces the hierarchical complexity of data as longitudinal ones by computing summary measures at the cluster level in the first stage, which are then analyzed as a response random variable in the second stage. Therefore, the main assumption is to have (at least asymptotically) unbiased estimators of these summary measures.
In the first stage, a generalized linear model can be fitted separately for each subject
$j \in \{1,\dots , N\}$
including only the h covariates
$\boldsymbol {K}_{ij}\in \mathbb {R}^{1 \times h}$
with
$i \in \{1, \dots , n_j\}$
that vary within-cluster j, i.e.,

where
$\boldsymbol {\tau }_j\in \mathbb {R}^{h}$
the vector of h parameters of the cluster j,
$g(\cdot )$
is the link function and
$\mathbb {E}(y_{ij}) = \mu _{ij}$
with
$y_{1j}, \dots , y_{n_{j}j}$
following a distribution from the exponential dispersion family (Agresti, Reference Agresti2015)
$\forall j \in \{1,\dots , N\}$
. Fitting the model defined in Equation (5) leads to a vector of estimated parameters
$\hat {\boldsymbol {\tau }}_j \in \mathbb {R}^{h}$
for each cluster j corresponding to the design matrix
$\boldsymbol {K}_j$
that vary within-subject j. Regarding our example, a natural choice would be to model each subject with logistic regression by Unhealth
$\mathtt {\sim }$
1 + Time, that is,
$\hat {\boldsymbol {\tau }}_j$
would be the vector containing the estimated intercept and slope for each child j, i.e., our “summary statistics”.
In the second stage, the N vectors
$\hat {\boldsymbol {\tau }}_j$
of estimated coefficients of each cluster j is collected in a
$N\times h$
matrix
$\mathbf {T}$
and modeled by a linear model having the between-cluster variables as predictors:

where
$\boldsymbol {X} \in \mathbb {R}^{N\times l}$
is the matrix of between-cluster variables (i.e., Intercept, Country, Age and Sex in our example),
$\boldsymbol {\beta } \in \mathbb {R}^{l\times h}$
is the matrix of predictors where each column relates to a different column of
$\mathbf {T}$
, and
$\mathbf {E} \in \mathbb {R}^{N\times h}$
is the matrix of errors.
Consider again the model of interest presented in Section 5, i.e., Unhealth
$\mathtt {\sim }$
1+ Country + Age + Sex + Time + Sex:Time. The intercept of within-cluster effects (i.e., the first column of
$\mathbf {T}$
) is modeled by the between-cluster effects Intercept, Country, Age and Sex. Instead, the estimated slope related to Time (i.e., the second column of
$\mathbf {T}$
) is modeled only by Intercept and Sex covariates. Therefore, in the second column of
$\mathbf {T}$
, only those coefficients of Intercept and Sex covariates will be estimated, while the ones associated with Country and Age are set to
$0$
.
In summary, the problem is now rewritten as a multiple multivariate linear model where each observation is
$\hat {\boldsymbol {\tau }}_j \sim (\mathbf {x}_j \boldsymbol {\beta }, \boldsymbol {\psi }_j)$
with
$\mathbf {x}_j$
is the j row of
$\boldsymbol {X}$
from Equation (6). The variance of the error terms
$\boldsymbol {\psi }_j$
is the variance of the independent rows of the error matrix
$\mathbf {E}$
defined in Equation (6), i.e.,
$\boldsymbol {\psi }_j = \text {var}(\boldsymbol {\epsilon }_j)$
where
$\mathbf {E} = [ \boldsymbol {\epsilon }_1^\top , \dots , \boldsymbol {\epsilon }_N^\top ]^\top $
. In particular,
$\boldsymbol {\psi }_j$
can be decomposed by two sources of variability: the variance of the random effects (named
$\boldsymbol {\Sigma }$
) and the conditional variance of
$\hat {\boldsymbol {\tau }}_j$
(named
$\boldsymbol {\Sigma }_j$
), i.e.,
$\boldsymbol {\psi }_j = \boldsymbol {\Sigma } + \boldsymbol {\Sigma }_j$
. The first captures the between-subject variability due to the presence of random effects, while the second represents the within-cluster variability.
Outside the fully balanced designs with homoscedastic errors, the
$\boldsymbol {\psi }_j$
varies among clusters, making the standard linear model tools unreliable. The literature on “two-stage summary statistics” focuses on the estimate of these quantities (Beckmann et al., Reference Beckmann, Jenkinson and Smith2003; Finos & Basso, Reference Finos and Basso2014; Frison & Pocock, Reference Frison and Pocock1992; Helwig, Reference Helwig2019; Mumford & Poldrack, Reference Mumford and Poldrack2007) hence making assumptions on the data and being constrained to a precise specification of the random quantities in the model. In our data example, the researcher employing standard “two-stage summary statistics” approaches should choose whether intercept and slope are random coefficients and also among independent or correlated random effects–decisions that are not required with the approach we propose here.
The robustness of the standardized flipscores approach to variance misspecification results in an ideal tool to make inference on the models used in the second stage defined in Equation (6). The null hypothesis
$H_0: \beta = \beta _0$
is then tested with the standardized score test defined in Equation (4), applying the same sign-flipping transformation across all h models, still avoiding the complexities associated with formulating the random component of the model. In fact, the debate about how to formulate the random part inside the GLMM framework, i.e., choosing between a parsimonious (Bates, Kliegl, et al., Reference Bates, Kliegl, Vasishth and Baayen2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017) or “maximal” (Barr et al., Reference Barr, Levy, Scheepers and Tily2013) model, is open.
A further remark relates to fitting separated models for each cluster. Actually, any summary measures at the cluster level can be used in the second stage; the only required property is the (asymptotic) unbiasedness of these summary measures. As an example, in the application of Section 5 and in the simulations of Section 4, we will adopt and evaluate the use of the Firth correction (Firth, Reference Firth1995) in the fitting of the cluster-level binomial models, i.e., Equation (5); this helps in reducing the finite-sample bias of the maximum likelihood estimates. Although the unbiasedness property holds, the selection of summary measures is guided by the researcher’s specific objectives. For instance, in clinical trials, common choices include post-treatment means or mean changes relative to baseline (Frison & Pocock, Reference Frison and Pocock1992). In our example, we focus on the effect of time on health conditions.
4 Simulations
This section explores the proposed approach and compares it with the GLM, GLMM, and GEE. The comparison is in terms of type I error control and power. The dependent variable
$\boldsymbol {y}$
is simulated as a Bernoulli random variable with the following mean:

where
$1 \le j \le N$
refers to cluster j, which contains
$n_j$
observations, so that
$n = \sum _{j = 1}^{N} n_j$
. The
$\boldsymbol {X}$
and
$\boldsymbol {Z}$
matrices are the design matrices of the tested and nuisance parameters
$\beta $
and
$\gamma $
.
$\boldsymbol {X}$
and
$\boldsymbol {Z}$
are correlated and simulated as
$X_{ij} = O_{ij} + Z_j/2$
where
$O_{ij} \sim \mathcal {N}(0,1)$
and
$Z_{j} \sim \mathcal {N}(0,1)$
. Finally, the matrix
$\mathbf {U}$
defines the subject-specific random effect (i.e., random intercept) and
$\mathbf {D}$
the subject-specific random slope of
$\boldsymbol {X}$
.
Since the proposed approach can deal with different types of correlation structures, in these simulations, we consider the case of full random effects (i.e.,
$U_j$
and
$D_j$
present). We decided to follow Equation (7) to have simulations reflecting the scenario of the follow-up application proposed in Section 5. The ones considering a subject-specific nuisance covariate with and without the presence of an additional random effect for
$Z_{ij}$
are placed in Appendix A. The aim is to test
$H_0: \beta = 0$
under this correlation structure. We take
$\beta = 0$
to evaluate type I error control and
$\beta = 2$
to analyze the power of the approaches. The nuisance parameter is set to
$\gamma = 2$
. The subject-specific random effects
$U_j$
and
$D_j$
are simulated from an uncorrelated bivariate normal distribution with mean
$0$
and standard deviation
$0.5$
. In both scenarios,
$1000$
simulations are performed.
The flip2sss approach is proposed in two versions. The first one (called in the simulations’ results simply “flip2sss”) uses the maximum likelihood estimator of the classical generalized linear model in the first stage to estimate the summary statistics. The second one (called in the simulations’ results “flip2sss Firth”) utilized instead the modified score equations proposed by Firth (Reference Firth1995) applied to the logistic model (Heinze & Schemper, Reference Heinze and Schemper2002; Kosmidis & Firth, Reference Kosmidis and Firth2009; Puhr et al., Reference Puhr, Heinze, Nold, Lusa and Geroldinger2017) in the first stage. In fact, the main assumption of the method proposed is having unbiased (at least asymptotically) estimates of the parameter at the first stage defined in Equation (5). The bias reduction approach proposed, i.e., the Firth correction, permits dealing with the problem of “perfect separation” (Albert & Anderson, Reference Albert and Anderson1984) also called “monotone likelihood” (Bryson & Johnson, Reference Bryson and Johnson1981) when we estimate the summary statistics in the first stage. However, several bias reduction techniques are present in the literature (e.g., Cordeiro and McCullagh, Reference Cordeiro and McCullagh1991; Kenne Pagui et al., Reference Kenne Pagui, Salvan and Sartori2017; Kosmidis and Firth, Reference Kosmidis and Firth2021) that can be used instead of the Firth correction.
We then fit two GLMMs: The first one is the correctly specified (called in the simulations’ results simply “GLMM”), i.e., considering both a random intercept and a random slope in the model specification since
$\mu _{ij}$
depends on
$U_j$
and
$D_j$
in Equation (7). The second one is misspecified (called in the simulations’ results “GLMM misspecified”), i.e., we define only the random intercept in the model specification while
$D_j$
in Equation (7) is simulated.
The GEE fitted here refers to assuming the independent working correlation matrix. This choice is justified since the consistency of GEEs is not assured in the case of using an arbitrary working correlation matrix when the covariates
$X_{ij}$
can vary within a subject over time (Sullivan Pepe & Anderson, Reference Sullivan Pepe and Anderson1994). The independent working correlation matrix is typically a safe choice even though the independence assumption is not necessarily valid.
Although these simulation analyses compare the performance of different models in terms of type I error control and power, we must remark that the models defined above have distinct inferential objectives. Specifically, the GLMM provides conditional estimates, while the GEE yields marginal ones. Consequently, the interpretation of the estimated parameters is not directly comparable (Gardiner et al., Reference Gardiner, Luo and Roman2009).
In addition to the GEE and the two GLMMs described above, we also fit a standard GLM with a logit link function.
Figure 1 shows the estimated error rate considering
$n_j \in \{5,10,25,50\} \forall j \in \{1, \dots , N\}$
, and
$N \in \{10,50, 100\}$
, while Figure 2 presents only values of the estimated error between
$0$
and
$0.1$
to have a clearer vision about the difference on controlling type I error between the methods. First, we can note that the misspecified GLMM and the basic GLM do not control the type I error for small to large sample sizes. The correctly specified GLMM controls the type I error in the case of
$N \in \{50,100\}$
, while GEE fails in most of the scenarios considered. In contrast, the proposed flip2sss approach with Firth correction maintains type I error control in all scenarios, while we are conservative if the Firth correction is not used in the first stage. Across all the simulations, the correctly specified GLMM encounters a convergence problem
$0.28\%$
of the time and a singularity issue
$38.71\%$
of the time.

Figure 1 Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bound for
$\alpha = 0.05$
. The term “GLMM” refers to the GLMM with random intercept and random slope specified, whereas “misspecified GLMM” refers to the GLMM with solely the random intercept included in the model specification. The term “flip2sss” refers to the results using the maximum likelihood estimator in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995). Finally, the terms GLM and GEE refer to a GLM with a logit link function and a GEE with an independent working correlation matrix, respectively.

Figure 2 Zoom of Figure 1 considering the estimated type I error
$\in [0,0.1]$
. Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bound for
$\alpha = 0.05$
. The term “GLMM” refers to the GLMM with random intercept and random slope specified, whereas “misspecified GLMM” refers to the GLMM with solely the random intercept included in the model specification. The term “flip2sss” refers to the results using the maximum likelihood estimator in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995). Finally, the terms GLM and GEE refer to a GLM with a logit link function and a GEE with an independent working correlation matrix, respectively.
Figure 3 shows the estimated power under the same settings of Figure 1 for
$\beta = 2$
. In this case, we only show the results for the two versions of flip2sss and correctly specified GLMM since the GLM, GEE, and misspecified GLMM did not control the type I error in any scenario. We see that the flip2sss with Firth correction maintains high power as the correctly specified GLMM in most of the scenarios.

Figure 3 Estimated power considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model with corresponding colored confidence bounds at
$0.95$
level. The term “flip2sss” refers to the results using the maximum likelihood estimator in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995).
On average, full separation occurs
$25\%$
of the time when
$n_j \in \{5,10\}$
and
$\beta = 0$
. In contrast, full separation occurs
$83\%$
of the time for
$n_j = 5$
and
$43\%$
of the time for
$n_j = 10$
when
$\beta = 2$
. These percentages decrease as
$n_j$
increases. In addition, on average, we observe only
$0$
s in
$\approx 20\%$
of cases when
$n_j = 5$
and
$\approx 10\%$
of cases when
$n_j = 10$
regardless of whether
$\beta \in \{0,2\}$
.
Figure 4 shows the estimated type I error in the case of unbalanced data. Data are simulated following again Equation (7) with
$n_j \sim \text {Pois}(\lambda )$
where
$\lambda \in \{5, 10, 25, 50\}$
. We note the ability of the flip2sss approach proposed to handle unbalanced data in contrast to GEE. The method proposed is conservative in the case of a small number of repeated measurements if we do not use the Firth correction in the first stage. At the same time, the performance of GLMM strongly depends on correctly specifying the random part of the model, and the type I error is not controlled in the case of a small number of subjects. When the correctly specified GLMM is fitted, in
$33\%$
of the simulations, a singularity problem is present, while
$\approx 0.2\%$
of the times is a convergence or identifiability issue.

Figure 4 Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \sim \text {Pois}(\lambda )$
where
$\lambda \in \{5, 10, 25, 50\}$
repeated measurements, i.e., in the case of unbalanced data. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bound for
$\alpha = 0.05$
. The term “GLMM” refers to the GLMM with random intercept and random slope specified, whereas “misspecified GLMM” refers to the GLMM with solely the random intercept included in the model specification. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995). Finally, the terms GLM and GEE refer to a GLM with a logit link function and a GEE with an independent working correlation matrix, respectively.
Finally, Figure 5 shows the estimated power under the same settings of Figure 4 fixing
$\beta = 2$
. We can note the low power of flip2sss approach without the Firth correction in the case of small sample sizes. However, using the Firth correction, the estimated power reaches the level of the correctly specified GLMM, controlling the type I error at the same time. The results for GEE, as well as the ones for GLM and the misspecified GLMM, are not reported since it does not control the type I error in any scenarios of Figure 4.

Figure 5 Estimated power considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \sim \text {Pois}(\lambda )$
where
$\lambda \in \{5, 10, 25, 50\}$
repeated measurements, i.e., in the case of unbalanced data. Each line represents one model with corresponding colored confidence bounds at
$0.95$
level. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995).
On average, complete separation occurs
$72\%$
of the time for
$\lambda =5$
and
$38\%$
for
$\lambda =10$
when
$\beta = 0$
. If
$\beta =2$
,
$76\%$
of the cases encounter a complete separation problem when
$\lambda = 5$
, while
$48\%$
if
$\lambda = 10$
. In addition, we observe only
$0$
s approximately
$12\%$
(
$15\%$
) of the times on average when
$\lambda =5$
and
$4\%$
(
$6\%$
) of the time when
$\lambda =10$
and
$\beta $
equals
$0$
(
$2$
).
We can conclude from these simulations that the proposed method controlled the type I error for both small and large samples and for both few and many repeated measurements in the case of balanced and unbalanced data, unlike the two competitive methods considered, namely GEE and GLMM. In addition, a downside of the GLMM is that it requires the user to correctly specify the mixed model, which is often a problematic task, as mentioned above (Barr et al., Reference Barr, Levy, Scheepers and Tily2013; Bates, Kliegl, et al., Reference Bates, Kliegl, Vasishth and Baayen2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017).
5 Application
Adoption is a legal process in which an individual or a couple assumes permanent parental responsibility for a child who is not biologically related to them. The complexity of the adoption journey varies depending on factors such as the child’s age, the circumstances surrounding the adoption, and the child’s prior experiences. The dataset analyzed comes from a study of Santona et al. (Reference Santona, Tognasso, Miscioscia, Russo and Gorla2022), which is based on responses given by adoptive parents to reports called follow-ups. Follow-ups are periodic reports that provide information on the progress of the adoption, the child’s adjustment to the new environment, and their psychological and physical well-being. The Authorized Entity (i.e., an Italian institution that manages the adoption procedures), following the return of the adoptive family to Italy with the child, must carry out mandatory follow-ups required of adoptive families by the country of origin, which must then be transmitted to the foreign authority.
The sample analyzed consists of
$534$
minors adopted between
$2008$
and
$2017$
through the Italian Childhood Aid Center (CIAI). The minors range in age from
$0.8$
to
$12$
years upon arrival in Italy and come from different countries such as India, China, Burkina Faso, Colombia, Ethiopia, Thailand, Vietnam, and Cambodia. To study the trend over time of children’s physical development and health status detected at each follow-up, a new variable called Unhealth was created based on the following answers:
-
1. Physical development (regular versus irregular)
-
2. Psychomotor development (regular versus irregular)
-
3. Sleep-wake rhythm (regular versus irregular)
-
4. Nutrition (regular versus irregular)
-
5. Diseases (pathologies and/or developmental alterations versus none)
which belong to the life and medical areas of the follow-up report. The new variable, Unhealth, is defined as a binary variable where
$1$
means that at least one of the above health problems is present and
$0$
means none are present. In addition to the questions described above, we have various socio-demographic information for each child described in Table 1.
Table 1 Description of the socio-demographic variables.

The health status is recorded multiple times for each child; the aim is to understand the development of this variable across time, taking into account the evident within-child variability. Covariates such as Age (median
$5$
year, range: between
$1$
month and
$12$
years), Country (
$8$
countries, proportions ranging from
$1.2\%$
to
$47.2\%$
), and Sex (female
$42.0\%$
) defined in Table 1 are considered in the model as moderator variables. Figure 6 shows the relation between country and time. As is visible from the plot, the number of follow-ups and the time from the adoption vary among countries. This is due to differing regulations.
Here, we apply our proposed method presented in Section 3 and two other competitive approaches common in the literature, i.e., GEE and GLMM. The packages jointest (Finos & Andreella, Reference Finos and Andreella2024), geepack (Højsgaard et al., Reference Højsgaard, Halekoh and Yan2006), and lme4 (Bates, Mächler, et al., Reference Bates, Mächler, Bolker and Walker2015) are used respectively. We utilize the car package (Fox & Weisberg, Reference Fox and Weisberg2019) to compute the ANOVA test for the generalized linear mixed models. The R command used for performing the analysis is described in Code 1 for the proposed approach and Code 2 in Appendix B for the GEE and GLMM methods.

The random effects are directly managed by specifying the variable that identifies the cluster (e.g., subject) in the cluster and the subject-specific variable in the summstats_within arguments of the flip2sss function. Therefore, as mentioned in Section 3, the problem of defining the correlation structure related to the GEE and GLMM models is solved in a simple way. Specifically, we use the Firth correction (Firth, Reference Firth1995) based on the brglm2 R package (Kosmidis, Reference Kosmidis2023) for fitting the model in the first stage specified in the summstats_within argument. The combine_contrasts function computes the global test for each covariate at the subject level defined in the formula argument of the flip2sss function, i.e., Age, Country and Sex having as dependent variable the estimated intercept and slope specified in the summstats_within argument of the flip2sss function. The Mahalanobis distance (used by default) is used as the combining function (Pesarin, Reference Pesarin2001) specified in the comb_func argument of the combine_contrasts function and defined in Equation (4).
We start our analysis by estimating the GLMM with two random effects: a random slope defined by the survey time and a random intercept defined by the subject variable. The fixed effects part is specified in the same way as defined in the right part of the formula object in Code 1. Using the default optimizer, i.e., a two-stage optimization, where the Bound Optimization BY Quadratic Approximation (BOBYQA) iterative algorithm (Powell et al., Reference Powell2009) is used in the first stage and the Nelder-Mead approach (Lagarias et al., Reference Lagarias, Reeds, Wright and Wright1998) in the second one, the estimation process does not converge. This was solved by specifying the BOBYQA optimizer for both stages, increasing the maximum number of function evaluations to
$200000$
, leading to a high computational time (i.e.,
$\approx 4$
minutes). In contrast, GEE and flip2sss reach a computational time equal to
$0.3$
and
$10$
seconds, respectively.

Figure 6 Sample proportion of children with an unhealthy status across time for each country. The colored bars indicate a
$\approx 0.3$
confidence interval for each time point/country combination. If bars are absent, it indicates that the variance of Unhealth equals
$0$
.
Table 2 shows the results of the flip2sss analysis. The ANOVA-like test performed by the combine_contrasts function detects significant effects for all variables included in the model. More specifically, the time from the arrival in the new family and the age of arrival decrease the probability of health issues, while males are more prone to them.
Table 2 flip2sss model results: summary of the model and related Anova-like table.

Results of the GLMM and GEE methods are reported in Tables 3 and 4, respectively. In the GLMM case, we consider both intercept and random slopes; the results assuming only a random intercept are placed in Appendix B. In flip2sss, the statistical test related to Ethiopia is the significant one, while in GEE, Cambodia and India are the only nonsignificant coefficients. This is also true for the GLMM with the intercept as random effect only (see Appendix B), while by modeling both intercept and slopes as random, the nonsignificant ones are China, Cambodia, and India (see Table 3). This again emphasizes the problem that the choice of the random part’s structure affects the results. This problem is bypassed by adopting the proposed flip2sss approach.
Table 3 GLMM model results: summary of the model and related analysis of variance table.

Table 4 GEE results: summary of the model and related analysis of variance table.

6 Discussion
The extension of the flipscores (De Santis et al., Reference De Santis, Goeman, Hemerik and Finos2024; Hemerik et al., Reference Hemerik, Goeman and Finos2020) method proposed here (i.e., flip2sss) accounts for heteroscedasticity between- and within-subject dependence, which are crucial considerations in statistical modeling. With existing methods, the heteroscedastic nature of data can introduce bias and compromise the validity of statistical inferences. Accounting for within-subject dependence is likewise essential for valid statistical inference. The proposed method is shown to have the required properties through simulations and real-world applications. These demonstrate our method’s superiority in terms of statistical power and control over false positive rates compared to conventional parametric methods in several settings. In addition, the proposed approach is able to handle several random effects without an explicit specification in the model. The researcher is not required to choose the random structure of the model (i.e., random intercept and random slope or related quantities), as happens when the GLM or other common “two-stage summary statistics” approaches proposed in the literature (e.g., Beckmann et al., Reference Beckmann, Jenkinson and Smith2003; Finos and Basso, Reference Finos and Basso2014; Frison and Pocock, Reference Frison and Pocock1992; Helwig, Reference Helwig2019; Mumford and Poldrack, Reference Mumford and Poldrack2007) are used. The only prerequisite is a grasp of the underlying structure of data clusters in order to properly estimate the summary measures at the subject level in the first stage. In contrast, specifying the random part of a GLMM is a complex process: no general solution is available for the problem of choosing between, e.g., a parsimonious (Bates, Kliegl, et al., Reference Bates, Kliegl, Vasishth and Baayen2015; Matuschek et al., Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017) and a maximal (Barr et al., Reference Barr, Levy, Scheepers and Tily2013) model. This choice, however, directly impacts the inferential conclusions. This happens in some way also using the GEE approach, since it assumes a working correlation that is barely known in practice, and choosing the identity (i.e., assuming independence between the repeated measurements) is a safe choice to have consistent estimators (Sullivan Pepe & Anderson, Reference Sullivan Pepe and Anderson1994).
Some limitations of the approach are to be discussed. The correctness of the method depends directly on the quality estimation of the subject-level summary measures in the first stage. An unbiased (at least asymptotically) estimator is, in fact, necessary to properly control the type I error and to have satisfactory power in the second stage. Nevertheless, this is beyond the scope of the article, and future research will be needed to compare the results with different estimates of the subject-level measure and find the most suitable one in terms of type I error and power. Here, we present two common estimators, the maximum likelihood estimator and its Firth correction (Firth, Reference Firth1995). In addition, this drawback allows some freedom for the researcher to utilize different approaches in the first stage that are capable of handling, for example, perfect separation problems in the logistic model and/or a low number of repeated measurements. Several works are, in fact, present in the literature (Cordeiro & McCullagh, Reference Cordeiro and McCullagh1991; Kenne Pagui et al., Reference Kenne Pagui, Salvan and Sartori2017; Kosmidis et al., Reference Kosmidis, Kenne Pagui and Sartori2020; Kosmidis & Firth, Reference Kosmidis and Firth2009, Reference Kosmidis and Firth2021) to correct the
$\mathcal {O}(n^{-1})$
bias where n stands for the sample size, i.e., repeated measurements within the cluster in our case. Another limitation includes handling more complex correlation structures such as crossed-random effects (Anderson & Braak, Reference Anderson and Braak2003). The cases described above and, in addition, the extension to the multivariate case (i.e., when we have more than one dependent variable) will be properly investigated in further works.
In conclusion, our research extends the applicability of permutation-based approaches in generalized linear mixed models, offering a flexible and reliable tool to overcome the limitations of traditional statistical parametric methods. With its ability to handle complex random effect structures and account for heteroscedasticity and within-subject dependence, our approach promises to advance statistical inference in psychometric research, neuroscience, and beyond. As such, it opens up new avenues for exploring intricate relationships and dependencies within diverse datasets, expanding the horizons of statistical analysis in various fields.
Author contribution
Angela Andreella: conceptualization, methodology, data curation, formal analysis, investigation, writing - original draft, writing - review & editing. Jelle Goeman: conceptualization, methodology, review & editing, supervision. Jesse Hemerik: conceptualization, methodology, investigation, review & editing, supervision. Livio Finos: conceptualization, methodology, data curation, investigation, writing - review & editing, supervision.
Funding statement
This work was supported by the Department of Economics and Management, University of Trento.
Competing interests
The authors declare that they have no competing interests.
Appendix
A Simulation results
Here, we present simulations in the case of a subject-specific nuisance covariate. The dependent variable
$\boldsymbol {y}$
is then simulated as a Bernoulli random variable with mean:

following the same notations of Equation (7). Here,
$X_{ij}$
and
$Z_{ij}$
are simulated from a bivariate normal distribution with mean
$0$
and correlation equals
$0.5$
. We have
$U_{j} \sim \mathcal {N}(0,.5)$
,
$D_{j} \sim \mathcal {N}(0,.5)$
and
$G_{j} \sim \mathcal {N}(0,.5)$
. We perform
$1000$
simulations for each scenario and, as Section 4, we impose
$\beta = 0$
for estimating the type I error and
$\beta = 2$
for analyzing the power of the approaches, while the value of the nuisance parameter
$\gamma $
is fixed at
$2$
.
Figure A1 represents the scenario when Equation (A.1) is simulated without the random effect with respect to the within-subject nuisance covariate
$Z_{ij}$
, i.e.,
$G_j = 0$
. We can note how GEE does not control the type I error, while the performance of GLMM strongly depends again on the correct specification of the random structure. Here, the “GLMM misspecified” refers to fitting a GLMM without specifying the random slope for
$X_{ij}$
while simulations follow Equation (A.1) with
$G_j = 0$
. The corrected specified GLMM returns
$37.73\%$
of the times a singularity warning and
$0.63\%$
of the times a non-convergence issue. In these cases, we consider a p-value equals
$1$
. Finally, Figure A2 shows the estimated power. Even if the flip2sss Firth method loses some power in the case of a small number of repeated measurements, the trend follows roughly the one of the correctly specified GLMM.

Figure A1 Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bounds for
$\alpha = 0.05$
. The term “GLMM” refers to the GLMM with random intercept and random slope specified, whereas “misspecified GLMM” refers to the GLMM with solely the random intercept included in the model specification. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995). Finally, the terms GLM and GEE refer to a GLM with a logit link function and a GEE with an independent working correlation matrix, respectively.

Figure A2 Estimated power considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model with corresponding colored confidence bounds at level
$0.95$
. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995).

Figure A3 Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bounds for
$\alpha = 0.05$
. The term “misspecified GLMM 2” refers to the GLMM with random intercept and random slope for
$X_{ij}$
specified, whereas “misspecified GLMM” refers to the GLMM with solely the random intercept included in the model specification. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995). Finally, the terms GLM and GEE refer to a GLM with a logit link function and a GEE with an independent working correlation matrix, respectively.

Figure A4 Estimated power considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements. Each line represents one model with corresponding colored confidence bounds at level
$0.95$
. The term “flip2sss” refers to the results using the classic GLM in the first stage, while the term “flip2sss Firth” refers to the results using the bias correction proposed by Firth (Reference Firth1995).

Figure A5 Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements in the case of heteroscedasticity. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bounds for
$\alpha = 0.05$
. The “LMM” refers to the LMM specifying random intercept and random slope, while the “LMM misspecified” refers to the LMM with only random intercept. Finally, the terms LM and GEE refer to a normal linear model and a GEE with an independent working correlation matrix, respectively.

Figure A6 Zoom of Figure A5 considering the estimated type I error
$\in [0,0.1]$
. Estimated type I error considering
$N \in \{10,50,100\}$
number of subjects and
$n_j \in \{5,10,25,50\}$
repeated measurements in the case of heteroscedasticity. Each line represents one model, and the grey area around the dashed black line represents the
$0.95$
confidence bounds for
$\alpha = 0.05$
. The “LMM” refers to the LMM specifying random intercept and random slope, while the “LMM misspecified” refers to the LMM with only random intercept. Finally, the terms LM and GEE refer to a normal linear model and a GEE with an independent working correlation matrix, respectively.
Figure A3 shows the estimated type I error simulating
$\mathbb {E}(y_{ij})$
as Equation (A.1), i.e., with random slope for both
$X_{ij}$
and
$Z_{ij}$
. In this case, three GLMMs are fitted: (1) considering only random intercept named “GLMM misspecified”, (2) considering random intercept and random slope for
$X_{ij}$
(i.e., the random slope for
$Z_{ij}$
is not specified, named “GLMM misspecified 2”) and (3) considering random intercept and the two random slopes for
$X_{ij}$
and
$Z_{ij}$
(i.e., the correctly specified GLMM, named simply “GLMM”) in the model. We can note that GEE fails to control the type I error in any scenario, while again, the performance of GLMM depends on the specification of the random structure of the model. In contrast, the two approaches proposed, i.e., flip2sss with or without the Firth correction, control properly the type I error, maintaining a good power as shown in Figure A4.
Figure A5 shows the estimated type I error following the settings of Figure 1 but assuming
$y_{ij} \sim \mathcal {N}(\mu _{ij}, \sigma ^2_{ij})$
with
$\mu _{ij}= X_{ij} \beta + Z_j \gamma + U_j + D_j X_{ij}$
and
$\sigma ^2_{ij} = 4 X_{ij}^2$
. Figure A6 focuses on estimated error values ranging from
$0$
to
$0.1$
, providing a clearer comparison of the methods’ ability to control type I error. We can note how the flip2sss approach proposed is able to deal with heteroscedasticity and small sample size at the same time in contrast to GEE and Linear Mixed Model (LMM). Here, the lmerTest package (Kuznetsova, Reference Kuznetsova, Brockhoff and Christensen2017) is used to compute tests on lmer objects.
B Application
Table B1 represents the same results as Table 3 assuming only the random intercept instead of random intercept plus random slope (i.e., the modGLMM2 object defined in Code 2).
Table B1 GLMM results assuming only the random intercept: summary of the model and related analysis of variance table.

In Code 2, we report the R commands used to estimate the GLMMs and GEE.
