Sparse and Simple Structure Estimation via Prenet Penalization

Kei Hirose; Yoshikazu Terada

doi:10.1007/s11336-022-09868-4

Sparse and Simple Structure Estimation via Prenet Penalization

Published online by Cambridge University Press: 01 January 2025

Kei Hirose

and

Yoshikazu Terada

Show author details

Kei Hirose*: Affiliation:
Kyushu University Riken Center for Advanced Intelligence Project
Yoshikazu Terada: Affiliation:
Osaka University Riken Center for Advanced Intelligence Project
*: Correspondence should be made to Kei Hirose, Institute of Mathematics for Industry, Kyushu University, Fukuoka, Japan. Email: hirose@imi.kyushu-u.ac.jp; URL: https://keihirose.com

Article contents

Abstract
Estimation of Factor Analysis Model via Penalization
Prenet Penalty
Properties of the Prenet Penalty
Algorithm
Monte Carlo Simulations
Analysis of Big Five Personality Traits
Discussion
Supplemental Materials
Footnotes
References

Rights & Permissions

Abstract

We propose a prenet (product-based elastic net), a novel penalization method for factor analysis models. The penalty is based on the product of a pair of elements in each row of the loading matrix. The prenet not only shrinks some of the factor loadings toward exactly zero but also enhances the simplicity of the loading matrix, which plays an important role in the interpretation of the common factors. In particular, with a large amount of prenet penalization, the estimated loading matrix possesses a perfect simple structure, which is known as a desirable structure in terms of the simplicity of the loading matrix. Furthermore, the perfect simple structure estimation via the proposed penalization turns out to be a generalization of the k-means clustering of variables. On the other hand, a mild amount of the penalization approximates a loading matrix estimated by the quartimin rotation, one of the most commonly used oblique rotation techniques. Simulation studies compare the performance of our proposed penalization with that of existing methods under a variety of settings. The usefulness of the perfect simple structure estimation via our proposed procedure is presented through various real data applications.

Keywords

multivariate analysis quartimin rotation penalized maximum likelihood estimation perfect simple structure sparse estimation

Type: Theory and Methods
Information: Psychometrika , Volume 88 , Issue 4 , December 2023 , pp. 1381 - 1406

DOI: https://doi.org/10.1007/s11336-022-09868-4 [Opens in a new window]
Creative Commons: This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright: Copyright © 2022 The Author(s) under exclusive licence to The Psychometric Society

Factor analysis investigates the correlation structure of high-dimensional observed variables by the construction of a small number of latent variables called common factors. Factor analysis can be considered as a soft clustering of variables, in which each factor corresponds to a cluster and observed variables are categorized into overlapping clusters. For interpretation purposes, it is desirable for the observed variables to be well-clustered (Yamamoto and Jennrich, Reference Yamamoto and Jennrich2013) or the loading matrix to be simple (Thurstone, Reference Thurstone1947). In particular, the perfect simple structure (e.g., Bernaards and Jennrich, Reference Bernaards and Jennrich2003; Jennrich, Reference Jennrich2004), wherein each row of the loading matrix has at most one nonzero element, provides a non-overlapping clustering of variables in the sense that variables that correspond to nonzero elements of the jth column of the loading matrix belong to the jth cluster.

Conventionally, the well-clustered or simple structure of the loading matrix is found by rotation techniques. A number of rotation techniques have been proposed in the literature; for example, quartimin rotation (Carroll, Reference Carroll1953), varimax rotation (Kaiser, Reference Kaiser1958), promax rotation (Hendrickson and White, Reference Hendrickson and White1964), simplimax rotation (Kiers, Reference Kiers1994), geomin rotation (Yates, Reference Yates1987), and component loss criterion (Jennrich, Reference Jennrich2004, Reference Jennrich2006). The literature review of the rotation techniques is described in Browne Reference Browne2001. The main purpose of the factor rotation is to get a good solution that is as simple as possible. See, e.g., Thurstone (Reference Thurstone1935), Carroll (Reference Carroll1953), Neuhaus and Wrigley (Reference Neuhaus and Wrigley1954), Kaiser (Reference Kaiser1974), Bernaards and Jennrich (Reference Bernaards and Jennrich2003).

The problem with the rotation technique is that it cannot produce a sufficiently sparse solution in some cases (Hirose and Yamamoto, Reference Hirose and Yamamoto2015), because the loading matrix must be found among a set of unpenalized maximum likelihood estimates. To obtain sparser solutions than the factor rotation, we employ a penalization method. It is shown that the penalization is a generalization of the rotation techniques and can produce sparser solutions than the rotation methods (Hirose and Yamamoto, Reference Hirose and Yamamoto2015). Typically, many researchers use the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} -type penalization, such as the lasso (Tibshirani, Reference Tibshirani1996), the adaptive lasso (Zou, Reference Zou2006), and the minimax concave penalty (MCP) (Zhang, Reference Zhang2010); for example, Choi et al. (Reference Choi, Zou and Oehlert2011), Ning and Georgiou (Reference Ning and Georgiou2011), Srivastava et al. (Reference Srivastava, Engelhardt and Dunson2017), Hirose and Yamamoto (Reference Hirose and Yamamoto2015), Trendafilov et al. (Reference Trendafilov, Fontanella and Adachi2017), Hui et al. (Reference Hui, Tanaka and Warton2018). The $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} penalization shrinks some of the parameters toward exactly zero; in other words, parameters that need not to be modeled are automatically disregarded. Furthermore, the degrees of sparsity are freely adjusted by changing the value of a regularization parameter. The $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} penalization provides a good estimation accuracy, such as $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} consistency and model selection consistency in high dimension (e.g., Zhao and Yu Reference Zhao and Yu2007; Wainwright, Reference Wainwright2009; Bhlmann and van de Geer, Reference Bhlmann and van de Geer2011).

As described above, it is important to obtain a “good” loading matrix in the sense of simplicity and thus interpretability in the exploratory factor analysis. Although the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} penalization achieves the sparse estimation, the estimated loading matrix is not guaranteed to possess an interpretable structure. For example, a great amount of penalization leads to a zero matrix, which does not make sense from an interpretation viewpoint. Thus, the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} penalization cannot produce an interpretable loading matrix with a sufficient large regularization parameter. Even if a small value of regularization parameter is selected, the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} penalization cannot often approximate a true loading matrix when it is not sufficiently sparse; with the lasso, some of the factor loadings whose true values are close —but not very close—to zero are estimated as zero values, and this misspecification can often cause a significant negative effect on the estimation of other factor loadings (Hirose and Yamamoto, Reference Hirose and Yamamoto2014). Therefore, it is important to estimate a loading matrix that is not only sparse but also interpretable. To achieve this, we need a different type of penalty.

In this study, we propose a prenet (product-based elastic net) penalty, which is based on the product of a pair of parameters in each row of the loading matrix. A remarkable feature of the prenet is that a large amount of penalization leads to the perfect simple structure. The existing $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} -type penalization methods do not have that significant property. Furthermore, the perfect simple structure estimation via the prenet penalty is shown to be a generalization of the k-means clustering of variables. On the other hand, with a mild amount of prenet penalization, the estimated loading matrix is approximated by that obtained using the quartimin rotation, a widely used oblique rotation method. The quartimin criterion can often estimate a non-sparse loading matrix appropriately, so that the problem of the lasso-type penalization mentioned above is addressed. We employ the generalized expectation and maximization (EM) algorithm and the coordinate descent algorithm (e.g., Friedman et al., Reference Friedman, Hastie and Tibshirani2010) to obtain the estimator. The proposed algorithm monotonically decreases the objective function at each iteration.

In our proposed procedure, the regularization parameter controls the degrees of simplicity; the larger the regularization parameter is, the simpler the loading matrix is. The advantage of our proposed procedure is that we can change the degrees of simplicity according to the purpose of the analysis. This study focus on two different purposes of the analysis: (i) to find a loading matrix that fits the data and also is simple as much as possible and (ii) to conduct cluster analysis by estimating a perfect simple structure. The regularization parameter selection procedure differs depending on these two purposes. To achieve the purpose (i), we select the regularization parameter by the Akaike information criterion (AIC; Akaike, Reference Akaike, Petrov and Csaki1973; Zou et al., Reference Zou, Hastie and Tibshirani2007) or the Bayesian information criterion (BIC; Schwarz, Reference Schwarz1978; Zou et al., Reference Zou, Hastie and Tibshirani2007). The purpose (ii) is attained by setting the regularization parameter to be infinity.

We conduct the Monte Carlo simulations to compare the performance of our proposed method with that of $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} -type penalization and conventional rotation techniques. The Monte Carlo simulations investigate the performance in terms of both (i) and (ii); investigations of (i) and (ii) are detailed in Sects. 5.2 and 5.3, respectively. Our proposed method is applied to data from big five personality traits to study the performance for various sample sizes and impact of the regularization parameter on the accuracy. The analysis of big five personality traits aims at purpose (i). We also present the analyses of fMRI and electricity demand data, intended to purpose both (i) and (ii), in Section S2 of the supplemental material.

The rest of this article is organized as follows. Section 1 describes the estimation of the factor analysis model via penalization. In Sect. 2, we introduce the prenet penalty. Section 3 describes several properties of the prenet penalty, including its relationship with the quartimin criterion. Section 4 presents an estimation algorithm to obtain the prenet solutions. In Sect. 5, we conduct a Monte Carlo simulation to investigate the performance of the prenet penalization. Section 6 illustrates the usefulness of our proposed procedure through real data analysis. Extension and future works are discussed in Sect. 7. Some technical proofs and detail of our algorithm are shown in “Appendix.” Supplemental materials include further numerical and theoretical investigation, including numerical analyses of resting-state fMRI and electricity demand data.

1. Estimation of Factor Analysis Model via Penalization

Let $X = {(X_{1}, \dots, X_{p})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X} = (X_1,\dots ,X_p)^T$$\end{document} be a p-dimensional observed random vector with mean vector $0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{0}$$\end{document} and variance–covariance matrix $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} . The factor analysis model is

(1)

\begin{matrix} X = Λ F + ε, \end{matrix}

where $Λ = (λ_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda } = (\lambda _{ij})$$\end{document} is a $p \times m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \times m$$\end{document} loading matrix, $F = {(F_{1}, \dots, F_{m})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{F} = (F_1,\ldots ,F_m)^T$$\end{document} is a random vector of common factors, and $ε = {(ε_{1}, \dots, ε_{p})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\varepsilon } = (\varepsilon _1,\ldots , \varepsilon _p)^T$$\end{document} is a random vector of unique factors. It is assumed that $E (F) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}(\varvec{F} ) = \varvec{0}$$\end{document} , $E (ε) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}(\varvec{\varepsilon } ) = \varvec{0}$$\end{document} , $E (F F^{T}) = Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}(\varvec{F}\varvec{F}^T) = \varvec{\Phi }$$\end{document} , $E (ε ε^{T}) = Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}(\varvec{\varepsilon } \varvec{\varepsilon } ^T) = \varvec{\Psi }$$\end{document} , and $E (F ε^{T}) = O$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}(\varvec{F} \varvec{\varepsilon } ^T) = \varvec{O}$$\end{document} , where $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} is an $m \times m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m \times m$$\end{document} factor correlation matrix, and $Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }$$\end{document} is a $p \times p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \times p$$\end{document} diagonal matrix (i.e., strict factor model). The diagonal elements of $Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }$$\end{document} are referred to as unique variances. Under these assumptions, the variance–covariance matrix of observed random vector $X$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}$$\end{document} is $Σ = Λ Φ Λ^{T} + Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma } = \varvec{\Lambda } \varvec{\Phi }\varvec{\Lambda }^T+\varvec{\Psi }$$\end{document} .

In many cases, the orthogonal factor model (i.e., $Φ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi } = \varvec{I}_m$$\end{document} ) is used but is often oversimplified. Here, $I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{I}_m$$\end{document} is an identity matrix of order m. This paper covers the oblique factor model, which allows a more realistic estimation of latent factors than the orthogonal factor model in many cases (e.g., Fabrigar et al., Reference Fabrigar, Wegener, MacCallum and Strahan1999; Sass and Schmitt, Reference Sass and Schmitt2010; Schmitt and Sass, Reference Schmitt and Sass2011).

Let $x_{1}, \dots, x_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_1,\ldots ,\varvec{x}_n$$\end{document} be n observations and $S = (s_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S} = (s_{ij})$$\end{document} be the corresponding sample covariance matrix. Let $θ = {(vec {(Λ)}^{T}, diag {(Ψ)}^{T}, vech {(Φ)}^{T})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta } = (\mathrm{vec}(\varvec{\Lambda })^T,\mathrm{diag}(\varvec{\Psi })^T,\mathrm{vech}(\varvec{\Phi })^T)^T$$\end{document} be a parameter vector, where vech $(\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\cdot )$$\end{document} is a vector that consists of a lower triangular matrix without diagonal elements. We estimate the model parameter by minimizing the penalized loss function $ℓ_{ρ} (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{\rho }(\varvec{\theta })$$\end{document} expressed as

(2)

\begin{matrix} ℓ_{ρ} (θ) = ℓ (θ) + ρ P (Λ), \end{matrix}

where $ℓ (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\varvec{\theta })$$\end{document} is a loss function, $P (Λ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\varvec{\Lambda })$$\end{document} is a penalty function, and $ρ > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho > 0$$\end{document} is a regularization parameter. As a loss function, we adopt the maximum likelihood discrepancy function

(3)

\begin{matrix} ℓ_{DF} (θ) = \frac{1}{2} (tr (Σ^{- 1} S) - log | Σ^{- 1} S | - p), \end{matrix}

where DF is an abbreviation for discrepancy function. Assume that the observations $x_{1}, \dots, x_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_1,\ldots ,\varvec{x}_n$$\end{document} are drawn from the p-dimensional normal population $N_{p} (0, Σ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_p(\varvec{0},\varvec{\Sigma })$$\end{document} with $Σ = Λ Φ Λ^{T} + Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma } = \varvec{\Lambda }\varvec{\Phi } \varvec{\Lambda }^T+\varvec{\Psi }$$\end{document} . The minimizer of $ℓ_{DF} (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{\text {DF}}(\varvec{\theta })$$\end{document} is the maximum likelihood estimate. It is shown that $ℓ_{DF} (θ) \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{\text {DF}}(\varvec{\theta })\ge 0$$\end{document} for any $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }$$\end{document} , and $ℓ_{DF} (θ) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{\text {DF}}(\varvec{\theta }) = 0$$\end{document} if and only if $Λ Φ Λ^{T} + Ψ = S$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }\varvec{\Phi }\varvec{\Lambda }^T + \varvec{\Psi } = \varvec{S}$$\end{document} when $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma } $$\end{document} and $S$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S}$$\end{document} are positive definite matrices. It is worth noting that our proposed penalty, described in Sect. 2, can be directly applied to many other loss functions, including a quadratic loss used for generalized least squares (Jöreskog and Goldberger, Reference Jöreskog and Goldberger1971).

When $Φ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }=\varvec{I}_m$$\end{document} , the model has a rotational indeterminacy; both $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} and $Λ T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda } \varvec{T}$$\end{document} generate the same covariance matrix $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} , where $T$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{T}$$\end{document} is an arbitrary orthogonal matrix. Thus, when $ρ = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = 0$$\end{document} , the solution that minimizes (2) is not uniquely determined. However, when $ρ > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho > 0$$\end{document} , the solution may be uniquely determined except for the sign and permutation of columns of the loading matrix when an appropriate penalty $P (Λ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\varvec{\Lambda })$$\end{document} is chosen.

2. Prenet Penalty

2.1. Definition

We propose a prenet penalty

(4)

\begin{matrix} P (Λ) = \sum_{i = 1}^{p} \sum_{j = 1}^{m - 1} \sum_{k > j} (γ |, λ_{ij}, λ_{ik}, | +, \frac{1}{2}, (1 - γ), {(λ_{ij} λ_{ik})}^{2}), \end{matrix}

where $γ \in (0, 1]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \in (0,1]$$\end{document} is a tuning parameter. The most significant feature of the prenet penalty is that it is based on the product of a pair of parameters.

When $γ \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \rightarrow 0$$\end{document} , the prenet penalty is equivalent to the quartimin criterion (Carroll, Reference Carroll1953), a widely used oblique rotation criterion in factor rotation. As is the case with the quartimin rotation, the prenet penalty in (4) eliminates the rotational indeterminacy except for the sign and permutation of columns of the loading matrix and contributes significantly to the estimation of the simplicity of the loading matrix. The prenet penalty includes products of absolute values of factor loadings, producing factor loadings that are exactly zero.

2.2. Comparison with the Elastic Net Penalty

The prenet penalty is similar to the elastic net penalty (Zou and Hastie, Reference Zou and Hastie2005)

(5)

\begin{matrix} P (Λ) = \sum_{i = 1}^{p} \sum_{j = 1}^{m} (γ |, λ_{ij}, | +, \frac{1}{2}, (1 - γ), λ_{ij}^{2}), \end{matrix}

which is a hybrid of the lasso and the ridge penalties. Although the elastic net penalty is similar to the prenet penalty, there is a fundamental difference between these two penalties; the elastic net is constructed by the sum of the elements of parameter vector, whereas the prenet is based on the product of a pair of parameters.

Figure 1 shows the penalty functions of the prenet ( $P (x, y) = γ | x y | + (1 - γ) {(x y)}^{2} / 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(x,y) = \gamma |xy| + (1-\gamma )(xy)^2/2$$\end{document} ) and the elastic net ( $P (x, y) = γ (| x | + | y |) + (1 - γ) (x^{2} + y^{2}) / 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(x,y) = \gamma (|x|+|y|) + (1-\gamma )(x^2+y^2)/2$$\end{document} ) when $γ = 0.1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.1$$\end{document} , 0.5, 0.9. Clearly, the prenet penalty is a nonconvex function. A significant difference between the prenet and the elastic net is that although the prenet penalty becomes zero when either x or y attains zero, the elastic net penalty becomes zero only when both $x = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x = 0$$\end{document} and $y = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y = 0$$\end{document} . Therefore, for a two-factor model, the estimate of either $λ_{i 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{i1}$$\end{document} or $λ_{i 2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{i2}$$\end{document} can be zero with the prenet penalization, leading to a perfect simple structure. On the other hand, the elastic net tends to produce estimates in which both $| λ_{i 1} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\lambda _{i1}|$$\end{document} and $| λ_{i 2} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\lambda _{i2}|$$\end{document} are small.

The penalty functions in Fig. 1 also show that the prenet penalty becomes smooth as $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} decreases. Thus, the value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} controls the degrees of sparsity; the larger the value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} , the sparser the estimate of the loading matrix. With an appropriate value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} , the prenet penalty enhances both simplicity and sparsity of the loading matrix. Further investigation into the estimator against the value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is presented in Sect. 5.

Figure 1. Penalty functions of the prenet and the elastic net with various $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} .

3. Properties of the Prenet Penalty

3.1. Perfect Simple Structure

The model (1) does not impose an orthogonal constraint on the loading matrix $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} . For unconstrained $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} , most existing penalties, such as the lasso, shrink all coefficients toward zero when the tuning parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is sufficiently large; we usually obtain $\hat{Λ} = O$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda }} = \varvec{O}$$\end{document} when $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} . However, the following proposition shows that the prenet penalty does not necessarily shrink all of the elements toward zero even when $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is sufficiently large.

Proposition 1

Assume that we use the prenet penalty with $γ \in (0, 1]$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \in (0,1]$$\end{document} . As $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} , the estimated loading matrix possesses the perfect simple structure, that is, each row has at most one nonzero element.

Proof

As $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} , $P (\hat{Λ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\hat{\varvec{\Lambda }})$$\end{document} must satisfy $P (\hat{Λ}) \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\hat{\varvec{\Lambda }}) \rightarrow 0$$\end{document} . Otherwise, the second term of (2) diverges. When $P (\hat{Λ}) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\hat{\varvec{\Lambda }}) = 0$$\end{document} , ${\hat{λ}}_{ij} {\hat{λ}}_{ik} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\hat{\lambda }}_{ij}{\hat{\lambda }}_{ik} = 0$$\end{document} for any $j \neq k$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j \ne k$$\end{document} . Therefore, the ith row of $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} has at most one nonzero element. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

The perfect simple structure is known as a desirable property in the literature in factor analysis because it is easy to interpret the estimated loading matrix (e.g., Bernaards and Jennrich, Reference Bernaards and Jennrich2003). When $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is small, the estimated loading matrix can be away from the perfect simple structure but the goodness of fit to the model is improved.

Remark 1

For $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} , the consistency of the loading matrix is shown when the true loading matrix possesses the perfect simple structure. For simplicity, we consider the orthogonal case. Assume that $Σ = Λ Λ^{T} + Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma } = \varvec{\Lambda }\varvec{\Lambda }^T + \varvec{\Psi }$$\end{document} , where the true $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} possesses the perfect simple structure. As $n \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \rightarrow \infty $$\end{document} , the sample covariance matrix converges to the true covariance matrix almost surely; thus, the loss function (3) is minimized when $Σ = \hat{Λ} {\hat{Λ}}^{T} + \hat{Ψ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma } = \hat{\varvec{\Lambda }}\hat{\varvec{\Lambda }}^T + \hat{\varvec{\Psi }}$$\end{document} . When $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} , $\hat{Λ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda }}$$\end{document} must be the perfect simple structure. Therefore, we should consider the following problem:

\begin{matrix} Find (\hat{Λ}, \hat{Ψ}) that satisfies (\begin{matrix} Σ = \hat{Λ} {\hat{Λ}}^{T} + \hat{Ψ}, \\ \hat{Λ} is perfect simple structure. \end{matrix}) \end{matrix}

The solution to the above problem is $\hat{Λ} = Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda }} = \varvec{\Lambda }$$\end{document} except for the sign and permutation of columns of the loading matrix if an identifiability condition for an orthogonal model (e.g., Theorem 5.1 in Anderson and Rubin, Reference Anderson and Rubin1956) is satisfied.

3.2. Relationship with k-Means Clustering of Variables

The perfect simple structure corresponds to variables clustering, that is, variables that correspond to nonzero elements of the jth column of the loading matrix belong to the jth cluster. In this subsection, we investigate the relationship between the prenet with $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} and the k-means clustering of variables, one of the most popular cluster analyses.

Let $X_{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_0$$\end{document} be an $n \times p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \times p$$\end{document} data matrix. $X_{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_0$$\end{document} can be expressed as $X_{0} = (x_{1}^{*}, \dots, x_{p}^{*})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_0 = (\varvec{x}_1^*,\dots ,\varvec{x}_p^*)$$\end{document} , where $x_{i}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_i^*$$\end{document} is the ith column vector of $X_{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_0$$\end{document} . We consider the problem of the variables clustering of $x_{1}^{*}, \dots, x_{p}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}_1^*,\dots ,\varvec{x}_p^*$$\end{document} by the k-means. Let $C_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_j$$\end{document} $(j = 1, \dots, m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(j = 1,\dots ,m)$$\end{document} be a subset of indices of variables that belong to the jth cluster. The objective function of the k-means is

(6)

\begin{matrix} \sum_{j = 1}^{m} \sum_{i \in C_{j}} {‖ x_{i}^{*} - μ_{j} ‖}^{2} = (n - 1) (\sum_{i = 1}^{p} s_{ii} - \sum_{j = 1}^{m} \frac{1}{p_{j}} \sum_{i \in C_{j}} \sum_{i^{'} \in C_{j}} s_{i i^{'}}), \end{matrix}

where $p_{j} = # {C_{j}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_j = \#\{C_j\}$$\end{document} , $μ_{j} = \frac{1}{p_{j}} \sum_{i \in C_{j}} x_{i}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\mu }_j = \frac{1}{p_j}\sum _{i \in C_j} \varvec{x}_i^*$$\end{document} , and recall that $s_{i i^{'}} = x_{i}^{* T} x_{i^{'}}^{*} / (n - 1)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{ii'} = \varvec{x}_i^{*T}\varvec{x}_{i'}^*/(n-1)$$\end{document} . Let $Λ = (λ_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda } = (\lambda _{ij})$$\end{document} be a $p \times m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \times m$$\end{document} indicator variables matrix

(7)

\begin{matrix} λ_{ij} = (\begin{matrix} 1 / \sqrt{p_{j}} & i \in C_{j}, \\ 0 & i \notin C_{j} . \end{matrix}) \end{matrix}

Using the fact that $Λ^{T} Λ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }^T\varvec{\Lambda } = \varvec{I}_m$$\end{document} , the k-means clustering of variables using (6) is equivalent to (Ding et al., Reference Ding, He and Simon2005)

(8)

\begin{matrix} min_{Λ} {‖ S - Λ Λ^{T} ‖}_{F}^{2}, subject to (7), \end{matrix}

where ${‖ \cdot ‖}_{F}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \cdot \Vert _F$$\end{document} denotes the Frobenius norm. We consider slightly modifying the condition on $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} in (7) to

(9)

\begin{matrix} λ_{ij} λ_{ik} = 0 (j \neq k) and Λ^{T} Λ = I_{m} . \end{matrix}

The modified k-means problem is then given as

(10)

\begin{matrix} min_{Λ} {‖ S - Λ Λ^{T} ‖}_{F}^{2} subject to (9) . \end{matrix}

The condition (9) is milder than (7); if $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} satisfies (7), we obtain (9). The reverse does not always hold; with (9), the nonzero elements for each column do not have to be equal. Therefore, the modified k-means in (10) may capture a more complex structure than the original k-means.

Proposition 2

Assume that $Ψ = α I_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi } = \alpha \varvec{I}_p$$\end{document} , $Φ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }=\varvec{I}_m$$\end{document} , and $α$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} is given. Suppose that $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} satisfies $Λ^{T} Λ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }^T\varvec{\Lambda } = \varvec{I}_m$$\end{document} . The prenet solution with $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} is then obtained by (10).

Proof

The proof appears in Appendix A.1. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

The proposition 2 shows that the prenet solution with $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} is a generalization of the problem (10). As mentioned above, the problem (10) is a generalization of the k-means problem in (8). Therefore, the perfect simple structure estimation via the prenet is a generalization of the k-means clustering of variables. We remark that the condition $Ψ = α I_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi } = \alpha \varvec{I}_p$$\end{document} in Proposition 2 implies the probabilistic principal component analysis (probabilistic PCA; Tipping and Bishop, Reference Tipping and Bishop1999); the penalized probabilistic PCA via the prenet is also a generalization of the k-means clustering of variables.

3.3. Relationship with Quartimin Rotation

As described in Sect. 2, the prenet penalty is a generalization of the quartimin criterion (Carroll, Reference Carroll1953); setting $γ \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \rightarrow 0$$\end{document} to the prenet penalty in (4) leads to the quartimin criterion

\begin{matrix} P_{qmin} (Λ) = \sum_{i = 1}^{p} \sum_{j = 1}^{m - 1} \sum_{k > j} {(λ_{ij} λ_{ik})}^{2} . \end{matrix}

The quartimin criterion is typically used in the factor rotation. The solution of quartimin rotation method, say ${\hat{θ}}_{q}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_q$$\end{document} , is obtained by two-step procedure. First, we calculate an unpenalized estimate, denoted by $\hat{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}$$\end{document} . The estimate $\hat{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}$$\end{document} , that satisfies $ℓ (\hat{θ}) = min_{θ} ℓ (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle {\ell }(\hat{\varvec{\theta }})=\min _{\varvec{\theta }} {\ell }(\varvec{\theta })$$\end{document} , is not unique due to the rotational indeterminacy. The second step is the minimization of the quartimin criterion with a restricted parameter space ${θ | ℓ (θ) = {min}_{θ} ℓ (θ)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{ \varvec{\theta }|\ell (\varvec{\theta }) = \min _{\varvec{\theta }}\ell (\varvec{\theta }) \}$$\end{document} . Hirose and Yamamoto (Reference Hirose and Yamamoto2015) showed that the solution of the quartimin rotation, ${\hat{θ}}_{q}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_q$$\end{document} , can be obtained by

(11)

\begin{matrix} min_{θ} P_{qmin} (Λ), subject to ℓ (θ) = ℓ (\hat{θ}) \end{matrix}

under the condition that the unpenalized estimate of loading matrix $\hat{Λ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda } }$$\end{document} is unique if the indeterminacy of the rotation in $\hat{Λ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda } }$$\end{document} is excluded. It is not easy to check this condition, but several necessary conditions of the identifiability for the orthogonal model are provided (e.g., Theorem 5.1 in Anderson and Rubin, Reference Anderson and Rubin1956.)

Now, we show a basic asymptotic result of the prenet solution, from which we can see that the prenet solution is a generalization of the quartimin rotation. Let $(Θ, d)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\Theta ,d)$$\end{document} be a compact parameter space with distance d and $(Ω, F, P)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\Omega ,{\mathcal {F}},{\mathbb {P}})$$\end{document} be a probability space. Suppose that for any ${(vec {(Λ)}^{T}, diag {(Ψ)}^{T}, vech {(Φ)}^{T})}^{T} \in Θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathrm{vec}(\varvec{\Lambda })^T,\mathrm{diag}(\varvec{\Psi })^T,\mathrm{vech}(\varvec{\Phi })^T)^T\in \Theta $$\end{document} and any $T \in O (m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{T}\in {\mathcal {O}}(m)$$\end{document} , we have ${(vec {(Λ T)}^{T}, diag {(Ψ)}^{T}, vech {(Φ)}^{T})}^{T} \in Θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathrm{vec}(\varvec{\Lambda }\varvec{T})^T,\mathrm{diag}(\varvec{\Psi })^T,\mathrm{vech}(\varvec{\Phi })^T)^T\in \Theta $$\end{document} , where $O (m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(m)$$\end{document} is a set of $m \times m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m\times m$$\end{document} orthonormal matrices. Let $X_{1}, \dots, X_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_1,\dots ,\varvec{X}_n$$\end{document} denote independent $R^{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb {R}}^p$$\end{document} -valued random variables with the common population distribution $P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb {P}}$$\end{document} . Now, it is required that we can rewrite the empirical loss function and the true loss function as $ℓ (θ) = \sum_{i = 1}^{n} q (X_{i}; θ) / n$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\varvec{\theta })=\sum _{i=1}^n q(\varvec{X}_i;\varvec{\theta })/n$$\end{document} and $ℓ_{*} (θ) = \int q (x; θ) P (d x)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _*(\varvec{\theta })=\int q(\varvec{x};\varvec{\theta })\,{\mathbb {P}}(d\varvec{x})$$\end{document} , respectively. Let ${\hat{θ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_\rho $$\end{document} denote an arbitrary measurable prenet estimator which satisfies $ℓ ({\hat{θ}}_{ρ}) + ρ P ({\hat{Λ}}_{ρ}) = {min}_{θ \in Θ} ℓ (θ) + ρ P (Λ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\hat{\varvec{\theta }}_\rho ) +\rho P(\hat{\varvec{\Lambda }}_\rho )=\min _{\varvec{\theta }\in \Theta } \ell (\varvec{\theta })+\rho P(\varvec{\Lambda })$$\end{document} .

Condition 3.1

$q (x; θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q(\varvec{x};\varvec{\theta })$$\end{document} fulfills the following conditions:

• For each $x \in R^{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}\in {\mathbb {R}}^p$$\end{document} , function $q (x; θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q(\varvec{x};\varvec{\theta })$$\end{document} on $Θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} is continuous.
• There exists a $P$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb {P}}$$\end{document} -integrable function $g (x)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g(\varvec{x})$$\end{document} such that for all $x \in R^{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{x}\in {\mathbb {R}}^p$$\end{document} and for all $θ \in Θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }\in \Theta $$\end{document} $| q (x; θ) | \leq g (x)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|q(\varvec{x};\varvec{\theta })|\le g(\varvec{x})$$\end{document} .

Since $ℓ (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell (\varvec{\theta })$$\end{document} is the discrepancy function in Eq. (3), $q (x; θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q(\varvec{x};\varvec{\theta })$$\end{document} becomes a logarithm of density function of normal distribution; in this case, Condition 3.1 is satisfied. The following proposition shows that the prenet estimator converges almost surely to a true parameter which minimizes the quartimin criterion when $ρ \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow 0$$\end{document} as $n \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\rightarrow \infty $$\end{document} .

Proposition 3

Assume that Condition 3.1 is satisfied. We denote by $Θ_{q}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _q^*$$\end{document} a set of true solutions of the following quartimin problem.

\begin{matrix} min_{θ \in Θ} P_{qmin} (Λ) subject to ℓ_{*} (θ) = min_{θ \in Θ} ℓ_{*} (θ) . \end{matrix}

Let $ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _n$$\end{document} ( $n = 1, 2, \dots$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = 1,2,\dots $$\end{document} ) be a sequence that satisfies $ρ_{n} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _n > 0$$\end{document} and ${lim}_{n \to \infty} ρ_{n} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lim _{n \rightarrow \infty }\rho _n = 0$$\end{document} . Let the prenet solution with $γ \to 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \rightarrow 0$$\end{document} and $ρ = ρ_{n}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = \rho _n$$\end{document} be ${\hat{θ}}_{ρ_{n}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{\rho _n}$$\end{document} . Then we obtain

\begin{matrix} lim_{n \to \infty} d ({\hat{θ}}_{ρ_{n}}, Θ_{q}^{*}) = 0 a . s ., \end{matrix}

where $d (a, B) = {inf}_{b \in B} d (a, b)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d({a},B)=\inf _{{b}\in B}d({a},{b})$$\end{document} .

Proof

The proof is given in Appendix A.2. $□$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\square $$\end{document}

Remark 3.1

Proposition 3 uses a set of true solutions $Θ_{q}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _q^*$$\end{document} instead of one true solution $θ_{q}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\theta }_q^*$$\end{document} . This is because even if the quartimin solution does not have a rotational indeterminacy, it still has an indeterminacy with respect to sign and permutation of columns of the loading matrix.

Remark 3.2

The Geomin criterion (Yates, Reference Yates1987) often produces a loading matrix similar to that obtained by the Quartimin criterion (Asparouhov and Muthén Reference Asparouhov and Muthén2009). For the Geomin criterion, we add a small number to the loadings to address the identifiability problem (Hattori et al., Reference Hattori, Zhang and Preacher2017). Meanwhile, the prenet does not suffer from such a problem. The detailed discussion is described in Section S3 in the supplemental material.

4. Algorithm

It is well known that the solutions estimated by the lasso-type penalization methods are not usually expressed in a closed form because the penalty term includes an indifferentiable function. As the objective function of the prenet is nonconvex, it is not easy to construct an efficient algorithm to obtain a global minimum. Here, we use the generalized EM algorithm, in which the latent factors are considered to be missing values. The complete-data log-likelihood function is increased with the use of the coordinate descent algorithm (Friedman et al., Reference Friedman, Hastie and Tibshirani2010), which is commonly used in the lasso-type penalization. Although our proposed algorithm is not guaranteed to attain the global minimum, our algorithm decreases the objective function at each step. The update equation of the algorithm and its complexity are presented in Appendix B.

4.1. Efficient Algorithm for Sufficiently Large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document}

The prenet tends to be multimodal for large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} as is the case with the k-means algorithm. Therefore, we prepare many initial values, estimate the solutions for each initial value, and select a solution that minimizes the penalized loss function. In this case, it seems that we require heavy computational loads. However, we can construct an efficient algorithm for a sufficiently large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} .

For sufficiently large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} , the ith column of loading matrix $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} has at most one nonzero element, denoted by $λ_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{ij}$$\end{document} . With the EM algorithm, we can easily find the location of the nonzero parameter when the current value of the parameter is given. Assume that the (i, j)th element of the loading matrix is nonzero and the (i, k)th elements ( $k \neq j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \ne j$$\end{document} ) are zero. Because the penalty function attains zero for sufficiently large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} , it is sufficient to minimize the following function.

(12)

\begin{matrix} f (λ_{ij}) = λ_{i}^{T} A λ_{i} - 2 λ_{i}^{T} b_{i} = a_{jj} λ_{ij}^{2} - 2 λ_{ij} b_{ij} . \end{matrix}

The minimizer is easily obtained by

(13)

\begin{matrix} {\hat{λ}}_{ij} = b_{ij} / a_{jj} . \end{matrix}

Substituting (13) into (12) gives us $f ({\hat{λ}}_{ij}) = - \frac{b_{ij}^{2}}{a_{jj}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f({\hat{\lambda }}_{ij}) = -\frac{b_{ij}^2}{a_{jj}}$$\end{document} . Therefore, the index j that minimizes the function $f (λ_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f(\lambda _{ij})$$\end{document} is

\begin{matrix} j = {argmax}_{k} \frac{b_{ik}^{2}}{a_{kk}}, \end{matrix}

and $λ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\lambda }_i$$\end{document} is updated as ${\hat{λ}}_{ij} = b_{ij} / a_{jj}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{ij} = {b_{ij}}/{a_{jj}}$$\end{document} and ${\hat{λ}}_{ik} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{ik} = 0$$\end{document} for any k $\neq$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ne $$\end{document} j.

4.2. Selection of the Maximum Value of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document}

The value of $ρ_{max}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{\max }$$\end{document} , which is the minimum value of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} that produces the perfect simple structure, is easily obtained using $\hat{Λ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda }}$$\end{document} given by (13). Assume that ${\hat{λ}}_{ij} \neq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{ij}\ne 0$$\end{document} and ${\hat{λ}}_{ik} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{ik} = 0$$\end{document} ( $k \neq j$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \ne j$$\end{document} ). Using the update equation of $λ_{ik}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{ik}$$\end{document} in (A.10) and the soft thresholding function in (A.12) of Appendix A, we show that the regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} must satisfy the following inequality to ensure that $λ_{ik}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{ik}$$\end{document} is estimated to be zero.

\begin{matrix} (\frac{b_{ik} - a_{kj} {\hat{λ}}_{ij}}{a_{kk} + ρ ψ_{i} (1 - γ) {\hat{λ}}_{ij}^{2}}) \leq \frac{ψ_{i}}{a_{kk} + ρ ψ_{i} (1 - γ) {\hat{λ}}_{ij}^{2}} ρ γ | {\hat{λ}}_{ij} | . \end{matrix}

Thus, the value of $ρ_{max}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{\max }$$\end{document} is

\begin{matrix} ρ_{max} = max_{i} max_{k \in C_{i}} \frac{| b_{ik} - a_{kj} {\hat{λ}}_{ij} |}{γ ψ_{i} | {\hat{λ}}_{ij} |}, \end{matrix}

where $C_{i} = {k | k \neq j, {\hat{λ}}_{ij} \neq 0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_i = \{k|k \ne j, {\hat{\lambda }}_{ij} \ne 0\}$$\end{document} .

4.3. Estimation of the Entire Path of Solutions

The entire path of solutions can be produced with the grid of increasing values ${ρ_{1}, \dots, ρ_{K}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\rho _1,\dots ,\rho _K\}$$\end{document} . Here, $ρ_{K}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _K$$\end{document} is (5.2), and $ρ_{1} = ρ_{K} Δ \sqrt{γ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{1} = \rho _{K} \Delta \sqrt{\gamma }$$\end{document} , where $Δ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Delta $$\end{document} is a small value such as 0.001. The term $\sqrt{γ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{\gamma }$$\end{document} allows us to estimate a variety of models even if $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is small.

The entire solution path can be made using a decreasing sequence ${ρ_{K} \dots, ρ_{1}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\rho _K\dots ,\rho _1\}$$\end{document} , starting with $ρ_{K}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _K$$\end{document} . The proposed algorithm at $ρ_{K}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _K$$\end{document} does not always converge to the global minimum, so that we prepare many initial values, estimate solutions for each initial value with the use of the efficient algorithm described in Sect. 4.1, and select a solution that minimizes the penalized log-likelihood function. We can use the warm start defined as follows: the solution at $ρ_{k - 1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _{k-1}$$\end{document} is computed using the solution at $ρ_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _k$$\end{document} . The warm start leads to improved and smoother objective value surfaces (Mazumder et al., Reference Mazumder, Friedman and Hastie2011).

One may use the warm start with increasing $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} ; that is, the solution with $ρ = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =0$$\end{document} is obtained by the rotation technique with MLE, and then we gradually increase $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} , using the solution from the previous step. However, the decreasing sequence of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} has a significant advantage over the increasing sequence; the decreasing sequence allows the application to the $n < p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n<p$$\end{document} case. With an increasing order, the solution with $ρ = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =0$$\end{document} (MLE) is not available, and then the entire solution cannot be produced. Therefore, we adopt the warm start with decreasing sequence of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} instead of an increasing sequence.

Another method to estimate the entire solution is to use the cold start with multiple random starts. Although the cold start does not always produce a smooth estimate as a function of $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} , it can sometimes find a better solution than the warm start when the loss function has multiple local minima. However, the cold start often requires heavier computational loads than the warm start.

When $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is extremely small, the loss function becomes nearly flat due to rotational indeterminacy. However, in our experience, our proposed algorithm generally produces a smooth and stable estimate when the warm start is adopted. Even when the cold start is used, the estimate can often be stable for large sample sizes when $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is not extremely but sufficiently small, such as $ρ = 10^{- 4}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho =10^{-4}$$\end{document} . However, when $n < p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n<p$$\end{document} , the maximum likelihood estimate cannot be obtained; therefore, the cold start often produces an unstable estimate with small $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} .

4.4. Selection of the Regularization Parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document}

The estimate of the loading matrix depends on the regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} . As described in the Introduction, this study focus on two different purposes of the analysis: (i) exploratory factor analysis and (ii) clustering of variables. When the purpose of the analysis is (ii), we simply set $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} to achieve the perfect simple structure estimation. When the purpose of the analysis is (i), $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the AIC or the BIC (Zou et al., Reference Zou, Hastie and Tibshirani2007);

\begin{matrix} AIC = & - 2 n ℓ (\hat{θ}) + 2 p_{0}, \\ BIC = & - 2 n ℓ (\hat{θ}) + p_{0} log n, \end{matrix}

where $p_{0}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} is the number of nonzero parameters.

Our algorithm sometimes produces a loading matrix some of whose columns are zero vectors. In this case, the number of factors may be smaller than expected. The selection of the number of factors via the regularization is achieved by taking advantage of the zero column vectors estimation (Caner and Han, Reference Caner and Han2014; Hirose and Yamamoto, Reference Hirose and Yamamoto2015).

5. Monte Carlo Simulations

The performance of our proposed method is investigated through Monte Carlo simulations. The prenet penalization has two different purposes of analysis: clustering of variables and exploratory factor analysis. In this section, we investigate the performance in terms of both purposes. The comparison of various exploratory factor analysis methods is described in Sect. 5.2, and the investigation of clustering of variables is presented in Sect. 5.3.

5.1. Simulation Models

In this simulation study, we use three simulation models as below.

Model (A):

\begin{matrix} Λ = & (\begin{matrix} 0.8 & 0 & 0 & 0 \\ 0 & 0.7 & 0 & 0 \\ 0 & 0 & 0.6 & 0 \\ 0 & 0 & 0 & 0.5 \end{matrix}) \otimes 1_{25}, \end{matrix}

where $1_{25}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{1}_{25}$$\end{document} is a 25-dimensional vector with each element being 1.

Model (B):

The size of the loading matrix of Model (B) is the same as that of Model (A), and the nonzero factor loadings share the same values. However, all zero elements in Model (A) are replaced by small random numbers from $U (- 0.3, 0.3)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U(-0.3,0.3)$$\end{document} .

Model (C):

\begin{matrix} Λ = & {(Λ_{1}^{T}, Λ_{2}^{T})}^{T}, \\ Λ_{1} = & {(\begin{matrix} 0.79 & 0.00 & 0.00 & 0.49 & 0.50 & 0.00 & 0.68 & 0.29 & 0.66 & 0.33 & 0.00 & 0.00 & 0.62 \\ 0.00 & 0.77 & 0.00 & 0.53 & 0.00 & 0.50 & 0.32 & 0.66 & 0.00 & 0.00 & 0.62 & 0.34 & - 0.64 \\ 0.00 & 0.00 & 0.76 & 0.00 & 0.52 & 0.48 & 0.00 & 0.00 & 0.34 & 0.66 & 0.33 & 0.62 & 0.00 \end{matrix})}^{T}, \\ Λ_{2} = & {(\begin{matrix} - 0.62 & 0.62 & - 0.62 & 0.00 & 0.00 & 0.43 & 0.47 & 0.00 & 0.42 & 0.43 & 0.00 & 0.36 & 0.26 \\ 0.64 & 0.00 & 0.00 & 0.67 & - 0.67 & 0.58 & 0.00 & 0.50 & 0.57 & 0.00 & 0.50 & 0.38 & 0.43 \\ 0.00 & - 0.61 & 0.61 & - 0.63 & 0.63 & 0.00 & 0.57 & 0.48 & 0.00 & 0.57 & 0.46 & 0.38 & 0.38 \end{matrix})}^{T} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \varvec{\Lambda }= & {} (\varvec{\Lambda }_1^T, \ \varvec{\Lambda }_2^T)^T,\\ \varvec{\Lambda }_1= & {} \left( \begin{array}{rrrrrrrrrrrrrr} 0.79 &{} 0.00 &{} 0.00 &{} 0.49 &{} 0.50 &{} 0.00 &{} 0.68 &{} 0.29 &{} 0.66 &{} 0.33 &{} 0.00 &{} 0.00 &{} 0.62 \\ 0.00 &{} 0.77 &{} 0.00 &{} 0.53 &{} 0.00 &{} 0.50 &{} 0.32 &{} 0.66 &{} 0.00 &{} 0.00 &{} 0.62 &{} 0.34 &{} -0.64\\ 0.00 &{} 0.00 &{} 0.76 &{} 0.00 &{} 0.52 &{} 0.48 &{} 0.00 &{} 0.00 &{} 0.34 &{} 0.66 &{} 0.33 &{} 0.62 &{} 0.00\\ \end{array} \right) ^T,\\ \varvec{\Lambda }_2= & {} \left( \begin{array}{rrrrrrrrrrrrrr} -0.62 &{} 0.62 &{} -0.62 &{} 0.00 &{} 0.00 &{} 0.43 &{} 0.47 &{} 0.00 &{} 0.42 &{} 0.43 &{} 0.00 &{} 0.36 &{} 0.26 \\ 0.64 &{} 0.00 &{} 0.00 &{} 0.67 &{} -0.67 &{} 0.58 &{} 0.00 &{} 0.50 &{} 0.57 &{} 0.00 &{} 0.50 &{} 0.38 &{} 0.43 \\ 0.00 &{} -0.61 &{} 0.61 &{} -0.63 &{} 0.63 &{} 0.00 &{} 0.57 &{} 0.48 &{} 0.00 &{} 0.57 &{} 0.46 &{} 0.38 &{} 0.38 \\ \end{array} \right) ^T. \end{aligned}$$\end{document}

The simulation is conducted for both orthogonal and oblique models on (A) and an orthogonal model on (B) – (C). For Model (A), we write the orthogonal and oblique models as “Model (A-ORT)” and “Model (A-OBL),” respectively. Here, “ORT” and “OBL” are abbreviations for “orthogonal” and “oblique,” respectively. The factor correlations for the oblique model are set to be 0.4. The unique variances are calculated by $Ψ = diag (I_{p} - Λ Φ Λ^{T})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi } = \mathrm{diag} (\varvec{I}_p - \varvec{\Lambda }\varvec{\Phi }\varvec{\Lambda }^T)$$\end{document} .

In Model (A), the loading matrix possesses the perfect simple structure. In such cases, the prenet is expected to perform well because it is able to estimate the perfect simple structure for large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} (Proposition 1). Note that $p = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=100$$\end{document} on Model (A); therefore, maximum likelihood estimate cannot be available when $n < 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n < 100$$\end{document} .

The loading matrix of Model (B) is close to but not exactly a perfect simple structure. In this case, the prenet is expected to perform well when both $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} and $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} are close to zero, thanks to Proposition 3. Meanwhile, the lasso would not perform well; a small illustrative example with intuitive description is presented in Section S1 in supplemental material.

In Model (C), the loading matrix is sparse but more complex than the perfect simple structure. The loading matrix is a rotated centroid solution of the Thurstone’s box problem, reported in Thurstone (Reference Thurstone1947). We use data(ThurstoneBox26) in the fungible package in R to obtain the loading matrix. With the original loading matrix, some of the unique variances can be larger than 1 with $Ψ = diag (I_{p} - Λ Λ^{T})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi } = \mathrm{diag} (\varvec{I}_p - \varvec{\Lambda }\varvec{\Lambda }^T)$$\end{document} ; therefore, the elements of the original loading matrix are multiplied by 0.83. Furthermore, to enhance the sparsity, factor loadings whose absolute values are less than 0.1 are replaced by 0.

5.2. Accuracy Investigation

The model parameter is estimated by the prenet penalty with $γ = 1.0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma = 1.0$$\end{document} , 0.1, 0.01, the lasso, the MCP (Zhang, Reference Zhang2010)

\begin{matrix} ρ P (Λ; ρ; γ) = & \sum_{i = 1}^{p} \sum_{j = 1}^{m} ρ \int_{0}^{| λ_{ij} |} {(1 - \frac{x}{ρ γ})}_{+} d x \end{matrix}

with $γ = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma = 3$$\end{document} , and the elastic net with $γ = 0.1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.1$$\end{document} . The regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the AIC and the BIC. We also compute a limit, $lim_{ρ \to + 0} {\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle \lim _{\rho \rightarrow +0}{\hat{\Lambda }}_{\rho }$$\end{document} , where ${\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\Lambda }}_{\rho }$$\end{document} is the estimate of the loading matrix obtained with a regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} . We note that $lim_{ρ \to + 0} {\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle \lim _{\rho \rightarrow +0}{\hat{\Lambda }}_{\rho }$$\end{document} corresponds to the factor rotation with MLE (Hirose and Yamamoto, Reference Hirose and Yamamoto2015). In particular, the estimate with $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} and $γ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma \rightarrow +0$$\end{document} is equivalent to that obtained by the quartimin rotation with MLE, thanks to Proposition 3. We also conduct rotation techniques with MLE: varimax rotation (Kaiser, Reference Kaiser1958) for the orthogonal model and promax rotation (Hendrickson and White, Reference Hendrickson and White1964) for the oblique model. When MLE cannot be found due to $n < p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n<p$$\end{document} , we conduct the lasso and obtain the approximation of the MLE with $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} .

The warm start is used for Models (A) and (B). The dimension of these models is $p = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p=100$$\end{document} , and then the warm start is stabler than the cold start for small $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} in our experience. Meanwhile, we adopt the cold start on Model (C) because Thurstone’s box problem tends to possess multiple local minima.

In our implementation, the estimate of the elastic net sometimes diverges for the oblique model. Scharf and Nestler (Reference Scharf and Nestler2019a) reported the same phenomenon. To address this issue, we add a penalty $ζ log | Φ |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta \log |\varvec{\Phi }|$$\end{document} to Eq. (2) with $ζ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0.01$$\end{document} . This penalty is based on Lee (Reference Lee1981), a conjugate prior for Wishart distribution from a Bayesian viewpoint. We remark that the prenet does not tend to suffer from this divergence issue even if $ζ = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta =0$$\end{document} in our experience. This is probably because the prenet does not shrink all loadings to zero, thanks to Proposition 1. For example, assume that ${\hat{σ}}_{ij} = {\hat{λ}}_{im} {\hat{λ}}_{jk} {\hat{ϕ}}_{km}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\sigma }}_{ij} = {\hat{\lambda }}_{im}{\hat{\lambda }}_{jk} {\hat{\phi }}_{km}$$\end{document} ( $k \neq m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \ne m$$\end{document} ). When the elastic net penalization is adopted, both ${\hat{λ}}_{im}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{im}$$\end{document} and ${\hat{λ}}_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{jk}$$\end{document} are close to zero with a large $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} . When $s_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{ij}$$\end{document} is large, ${\hat{ϕ}}_{km}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\phi }}_{km}$$\end{document} must be large to get $σ_{ij} \approx s_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{ij} \approx s_{ij}$$\end{document} . Accordingly, the value of ${\hat{ϕ}}_{km}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\phi }}_{km}$$\end{document} can be significantly large; it can be greater than 1. Meanwhile, the prenet may not suffer from this problem because either ${\hat{λ}}_{im}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{im}$$\end{document} and ${\hat{λ}}_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{jk}$$\end{document} can become large.

For each model, $T = 1000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T = 1000$$\end{document} data sets are generated with $N (0, Λ Φ Λ^{T} + Ψ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N(\varvec{0},\varvec{\Lambda }\varvec{\Phi }\varvec{\Lambda }^T + \varvec{\Psi })$$\end{document} . The number of observations is $n = 50, 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n =50, 100$$\end{document} , and 500. To investigate the performance of various penalization procedures, we compare the root mean squared error (RMSE) over $T = 1000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T=1000$$\end{document} simulations, which is defined by

\begin{matrix} RMSE = \frac{1}{T} {(\sum_{s = 1}^{T}, \frac{‖ Λ - {\hat{Λ}}^{(s)} ‖_{F}^{2}}{pm})}^{1 / 2}, \end{matrix}

Figure 2. RMSE of factor loadings. The upper and lower bars represent 95th and 5th percentiles, respectively. Here, “ $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} ” denotes a limit of the estimate of the factor loadings, $lim_{ρ \to + 0} {\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle \lim _{\rho \rightarrow +0}{\hat{\Lambda }}_{\rho }$$\end{document} , which corresponds to the factor rotation.

Figure 3. Rate of nonzero factor loadings. The upper and lower bars represent 95th and 5th percentiles, respectively. Here, “ $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} ” denotes a limit of the estimate of the factor loadings, $lim_{ρ \to + 0} {\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle \lim _{\rho \rightarrow +0}{\hat{\Lambda }}_{\rho }$$\end{document} , which corresponds to the factor rotation.

The results for RMSE and the rate of nonzero factor loadings are depicted in Figs. 2 and 3, respectively. For reference, true positive rate (TPR) and false positive rate (FPR) of the loading matrix are depicted in Figures S1.4 and S1.5 in the supplemental material. The range of the error bar indicates 90% confidence interval; we calculate 5% and 95% quantiles over 1000 simulation results and use them as the boundaries of the error bar. We obtain the following empirical observations from these figures.

Model (A-ORT):: The prenet penalization outperforms the existing methods in terms of RMSE when the regularization parameter is selected by the model selection criteria. It is also seen that the performance of the prenet is almost independent of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} . When $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} , all of the estimation procedures yield similar performances. When the $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the model selection criteria, the rate of nonzero loadings of the prenet is almost 0.25, that is the true rate. Considering the TPR result in Figure S1.4 in the supplemental material, the prenet with AIC or BIC correctly estimates the true zero/nonzero pattern. Meanwhile, the MCP, elastic net, and lasso tend to select a denser model than the true one.
Model (A-OBL):: The result for the oblique model is similar to that for the orthogonal model, but the oblique model tends to produce larger RMSEs than the orthogonal model for most cases. The $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} produces larger RMSE than that with the regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} selected by model selection criteria. This is probably because the loss function becomes flat as $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} . Therefore, the regularization may help improve the accuracy. We note that the promax rotation, which corresponds to $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} , turns out to be stable.
Model (B):: When n is large, the prenet with small $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} and varimax rotation produces small RMSEs. Because the true values of cross-loadings (small loadings) are close to but not exactly zero, the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} type regularization that induces a sparse loading matrix does not work well. The prenet with $γ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.01$$\end{document} achieves the sparse estimation but produces a loading matrix that is similar to the quartimin rotation, resulting in a nonsparse loading matrix. We also observe that the prenet with BIC sometimes results in too sparse loading matrix when $n = 50$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=50$$\end{document} .
Model (C):: For small n, all methods result in large RMSE. For large n, the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} regularization methods, including the lasso, MCP, elastic net, and prenet with large $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} yield small RMSE. However, the prenet with small $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} and varimax rotation, which tend to estimate non-sparse loading matrix, produce large RMSE. Indeed, the average value of loading matrix in the supplemental material shows that the prenet with small $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is biased. Furthermore, the varimax rotation with true loading matrix does not approximate the true one. Therefore, when the loading matrix is sparse but does not have the perfect simple structure, the lasso-type penalization or prenet with $γ = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1$$\end{document} would perform well.

5.3. Investigation of Clustering via Perfect Simple Structure Estimation

As shown in Proposition 1, our proposed method allows the clustering of variables via the perfect simple structure estimation. We investigate the clustering accuracy on Model (A); the true loading matrix has the perfect simple structure, and then we know the true clusters. Figure 4 shows the Adjusted Rand Index (ARI) between true clusters and those obtained by prenet, lasso, and varimax. The range of the error bar indicates 90% confidence interval; we calculate 5% and 95% quantiles over 1000 simulation results and use them as the boundaries of the error bar.

The clustering via prenet is achieved by perfect simple structure estimation. The lasso and varimax cannot always estimate the perfect simple structure. Therefore, we estimate the clusters as follows: for ith row vector of $\hat{Λ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\Lambda }}$$\end{document} , say ${\hat{λ}}_{i} = {({\hat{λ}}_{i 1}, \dots, {\hat{λ}}_{im})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\lambda }}_i = ({\hat{\lambda }}_{i1}, \dots , {\hat{\lambda }}_{im})^T$$\end{document} , the ith variable belongs to jth cluster, where $j = \underset{j \in {1, \dots, m}}{\arg \max} (| {\hat{λ}}_{ij} |)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j = \mathop {\mathrm{arg~max}}\limits _{j \in \{1,\dots ,m\}}(|{\hat{\lambda }}_{ij}|)$$\end{document} . The regularization parameter for the lasso is $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} , which corresponds to a special case of the component loss criterion (Jennrich, Reference Jennrich2004, Reference Jennrich2006) with MLE.

The result of Fig. 4 shows that the prenet and varimax result in almost identical ARIs and are slightly better than the lasso when $n = 50$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=50$$\end{document} on Model (A-ORT). All methods correctly detect the true clusters when $n = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=100$$\end{document} and $n = 500$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=500$$\end{document} . For Model (A-OBL), the prenet performs slightly better than the varimax when $n = 50$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=50$$\end{document} . As with the orthogonal model, the prenet and varimax correctly detect the true clusters when $n = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=100$$\end{document} and $n = 500$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=500$$\end{document} . The lasso performs worse than the other two methods for small sample sizes, suggesting that the prenet or varimax would be better if the clustering of variables is the purpose of the analysis.

Figure 4. Adjusted Rand Index (ARI) of the clustering results.

6. Analysis of Big Five Personality Traits

We apply the prenet penalization to the survey data regarding the big five personality traits collected from Open Source Psychometrics Project (https://openpsychometrics.org/). Other real data applications (electricity demand and fMRI data) are described in Section S2 of the supplemental material. $n = 8582$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=8582$$\end{document} responders in the US region are asked to assess their own personality based on 50 questions developed by Goldberg (Reference Goldberg1992). Each question asks how well it describes the statement of the responders on a scale of 1–5. It is well known that the personality is characterized by five common factors; therefore, we choose $m = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=5$$\end{document} . Several earlier researchers showed that the loading matrix may not possess the perfect simple structure due to the small cross-loadings (Marsh et al., Reference Marsh, Lüdtke, Muthén, Asparouhov, Morin, Trautwein and Nagengast2010, Reference Marsh, Nagengast and Morin2013; Booth and Hughes, Reference Booth and Hughes2014); therefore, we do not aim at estimating the perfect simple structure with $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} in this analysis. We first interpret the estimated model and then investigate the performance of the prenet penalization with $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} selected by model selection criteria for various sample sizes. The impact of the regularization parameter on the accuracy is also studied.

6.1. Interpretation of Latent Factors

We first apply the prenet penalization and the varimax rotation with maximum likelihood estimate and compare the loading matrices estimated by these two methods. With the prenet penalization, we choose a regularization parameter using AIC, BIC, and tenfold cross-validation (CV) with $γ = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1$$\end{document} . The regularization parameter selected by the AIC and CV is $ρ = 7.4 \times 10^{- 4}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = 7.4 \times 10^{-4}$$\end{document} , and that selected by the BIC is $ρ = 2.9 \times 10^{- 3}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho = 2.9\times 10^{-3}$$\end{document} . The heatmaps of the loading matrices are shown in Fig. 5. The result of Fig. 5 shows that these heatmaps are almost identical; all methods are able to detect the five personality traits appropriately. We also observe that the result is almost independent of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} . These similar results may be due to the large sample sizes.

Figure 5. Heatmaps of the loading matrices on big five personality traits data. Each cell corresponds to the factor loading, and the depth of color indicates the magnitude of the value of the factor loading.

We explore the estimates of cross-loadings whose absolute values are larger than 0.3; it would be reasonable to regard that these cross-loadings affect the items. There exists four items that include such large absolute values of cross-loadings, and factor loadings related to these four items are shown in Table 1.

Table 1. Factor loadings of four items estimated by the prenet penalization with $γ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.01$$\end{document} . The regularization parameter, $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} , is selected by the BIC. The cross-loadings whose absolute values are larger than 0.3 are written in bold.

The loading matrix is estimated by the prenet penalization with $γ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.01$$\end{document} . The regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the BIC. The five factors represent “F1: extraversion,” “F2: neuroticism,” “F3: agreeableness,” “F4: conscientiousness,” and “F5: openness to experience.” For reference, the complete loading matrix is shown in Tables S2.5 and S2.6 of the supplemental material.

The three items, A2, A7, and A10, are affected by “F1: extraversion” and “F3: agreeableness.” The main and cross-loadings on the same item have opposite signs. We may make a proper interpretation of the factor loadings. For example, as for the question “A7: Have a soft heart,” it is easy to imagine some people who have an extraversion cannot be kind. They are interested in a profit from a person rather than the situation that the person is in now; thus, they can become selfish to get the profit even if the person’s feelings are hurt. Booth and Hughes (Reference Booth and Hughes2014) also reported similar results of such cross-loadings. They mentioned that these cross-loadings were due to the overlap in content between extraversion and agreeableness.

Furthermore, we perform $M = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=100$$\end{document} subsampling simulation with $n = 500$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=500$$\end{document} to investigate whether these cross-loadings can be found with small sample sizes. We compare the performance of four estimation methods: the prenet, lasso, MCP, and varimax rotation. For regularization methods, $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the BIC. We set $γ = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1$$\end{document} and $γ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.01$$\end{document} for the prenet and $γ = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =3$$\end{document} for MCP.

Table 2. The number of times that the absolute values of four cross-loadings exceed 0.3. For regularization methods, $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the BIC.

Table 2 shows the number of times that the absolute values of these four cross-loadings exceed 0.3. The results show that the prenet with $γ = 0.01$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =0.01$$\end{document} most frequently identifies these four cross-loadings among $M = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M=100$$\end{document} simulations.

6.2. RMSE Comparison

We investigate the performance of the prenet in terms of estimation accuracy of the loading matrix through subsampling simulation. First, the dataset is randomly split into two datasets, $X_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_1$$\end{document} and $X_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_2$$\end{document} , without replacement. The sample sizes of $X_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_1$$\end{document} and $X_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_2$$\end{document} are $n / 2 = 4291$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n/2 = 4291$$\end{document} . The $X_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_1$$\end{document} is used for estimating a loading matrix with large sample sizes; we perform the varimax rotation with MLE and regard the estimated loading matrix as a true loading matrix, say $Λ_{true}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }_{\mathrm{true}}$$\end{document} . The true loading matrix is almost identical to the loading matrix obtained by the varimax with the entire dataset. We remark that the true loading matrix is also similar to the Model (B) of the Monte Carlo simulation described in Sect. 5.1.

The performance is investigated by subsampling the observations from $X_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{X}_2$$\end{document} with $n = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=100$$\end{document} and $n = 500$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=500$$\end{document} . Figure 6 depicts RMSE and rate of nonzero loadings for n random subsampled data over 100 simulations. The RMSE is defined as

\begin{matrix} RMSE = \frac{1}{100} {(\sum_{s = 1}^{100}, \frac{‖ Λ_{true} - {\hat{Λ}}^{(s)} ‖_{F}^{2}}{pm})}^{1 / 2}, \end{matrix}

where ${\hat{Λ}}^{(s)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{ \varvec{\Lambda }}^{(s)}$$\end{document} is the estimate of the loading matrix using the sth subsampled data. We apply the lasso, MCP with $γ = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =3$$\end{document} , prenet with $γ = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1$$\end{document} , 0.1, 0.01, and the varimax rotation with MLE. The regularization parameter $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} is selected by the AIC, BIC, and tenfold CV. We also compute the loading matrix when $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} , which results in the solution of the factor rotation with MLE. The nonzero pattern of the loading matrix for $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} is estimated by a hard-thresholdings with a cutoff value being 0.1. The range of the error bar indicates 90% confidence interval over 100 simulations.

Figure 6. RMSE and rate of nonzero loadings when $n = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=100$$\end{document} and 500. Here, “ $ρ \to + 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow +0$$\end{document} ” denotes a limit of the estimate of the factor loadings, $lim_{ρ \to + 0} {\hat{Λ}}_{ρ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle \lim _{\rho \rightarrow +0}{\hat{\Lambda }}_{\rho }$$\end{document} , which corresponds to the factor rotation.

We have the following empirical observations from Fig. 6:

• The smaller the number of observations is, the sparser the solution is. An earlier study has shown that the model selection criterion can select a parsimonious model with small sample sizes in general frameworks (Cudeck and Henly, Reference Cudeck and Henly1991).
• The BIC results in larger RMSE and lower rate of nonzero loadings than other criteria, especially for small sample sizes. Therefore, the BIC tends to select sparse solutions, and some of the small nonzero factor loadings are estimated to be zero.
• When the lasso or MCP is applied, the CV results in poor RMSE when $n = 100$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=100$$\end{document} . This is because the estimated loading matrix is too sparse; it becomes (almost) zero matrix. When the prenet is applied, such a loading matrix cannot be obtained thanks to Proposition 1.
• With the prenet, small $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} tends to estimate a dense loading matrix and produce good RMSE. A similar tendency is found in Model (B) of the Monte Carlo simulation, described in Sect. 5.2.

6.3. Impact of Tuning Parameters

We investigate the impact of the tuning parameters $(ρ, γ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\rho , \gamma )$$\end{document} on the estimation of the loading matrix. Figure 7 depicts the heatmaps of the loading matrices for various values of tuning parameters on the MCP and the prenet penalization. We find the tuning parameters so that the degrees of sparseness (proportion of nonzero values) of the loading matrix are approximately 20%, 25%, 40%, and 50%. For the MCP, we set $γ = \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma = \infty $$\end{document} (i.e., the lasso), 5.0, 2.0, and 1.01. For prenet penalty, the values of gamma are $γ = 1.0,$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1.0,$$\end{document} 0.5, and 0.01. Each cell describes the elements of the factor loadings as with Fig. 5.

Figure 7. Heatmaps of the loading matrices on big five personality traits data for various values of tuning parameters on the MCP and the prenet penalization.

From Fig. 7, we obtain the empirical observations as follows.

• With the prenet penalization, the characteristic of five personality traits are appropriately extracted for any values of tuning parameters, which suggests that the prenet penalization is relatively robust against the tuning parameters.
• The prenet penalization is able to estimate the perfect simple structure when the degree of sparseness is 20%. On the other hand, with the MCP, we are not able to estimate the perfect simple structure even when $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is sufficiently small.
• With the lasso, the number of factors becomes less than five when the degrees of sparsity are 20% and 25%; the five personality traits are not able to be found. When the value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is not sufficiently large, the MCP produces five factor model.

7. Discussion

We proposed a prenet penalty, which is based on the product of a pair of parameters in each row of the loading matrix. The prenet aims at the estimation of not only sparsity but also the simplicity. Indeed, the prenet is a generalization of the quartimin criterion, one of the most popular oblique techniques for simple structure estimation. Furthermore, the prenet is able to estimate the perfect simple structure, which gave us a new variables clustering method using factor models. The clustering of variables opens the door to the application of the factor analysis to a wide variety of sciences, such as image analysis, neuroscience, marketing, and biosciences.

The prenet penalization has two different purposes of analysis: clustering of variables and exploratory factor analysis. The way of using the prenet penalization depends on the purpose of the analysis. When the purpose of the analysis is the clustering of variables, the regularization parameter is set to be $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} to achieve the perfect simple structure estimation. It is shown that the prenet performs better than the lasso and varimax in terms of clustering accuracy, as described in Sect. 5.3. Furthermore, the real data analyses in Section S2 in the supplemental material show the superiority of the prenet over the conventional clustering methods, such as the k-means clustering. When the purpose of the analysis is exploratory factor analysis, the perfect simple structure estimation is not necessarily needed. In this case, the regularization parameter is selected by the model selection criteria. The numerical results show that the prenet penalization performs well when an appropriate value of $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is selected.

Over a span of several decades, a number of researchers have developed methods for finding a simple loading matrix (e.g., Kaiser, Reference Kaiser1958; Hendrickson and White, Reference Hendrickson and White1964) in the Thurstonian sense (Thurstone, Reference Thurstone1947). As the simple structure is a special case of sparsity, it seems the lasso-type sparse estimation is more flexible than the prenet. Indeed, the recent trend in the exploratory factor analysis literature is to find a loading matrix that possesses the sparsity rather than simplicity (Jennrich, Reference Jennrich2004, Reference Jennrich2006; Trendafilov, Reference Trendafilov2013; Scharf and Nestler, Reference Scharf and Nestler2019b, Reference Scharf and Nestlera).

Nevertheless, the lasso-type sparse estimation is not as flexible as expected. As mentioned in Model (B) in Monte Carlo simulation and Section S1 in the supplemental material, the lasso cannot often approximate the true loading matrix when the cross-loadings are not exactly but close to zero. Because some factor loadings are estimated to be exactly zero with the lasso, some other factor loadings turn out to be excessively large, which causes the difficulty in interpretation.

For this reason, we believe both sparsity and simplicity play important roles in the interpretation. The sparse estimation automatically produces the nonzero pattern of the loading matrix, which allows us to interpret the latent factors easily. In addition, simplicity is also helpful for the interpretation, as shown in Thurstone (Reference Thurstone1947). The prenet penalization is able to achieve simplicity and sparsity simultaneously. Indeed, a sparse loading matrix is estimated thanks to the penalty based on the absolute term in $\sum_{i, j, k} | λ_{ij} λ_{ik} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sum _{i,j,k}|\lambda _{ij}\lambda _{ik}|$$\end{document} . In addition, simplicity is also achieved because it generalizes the quartimin criterion that often produces a simple structure (Jennrich and Sampson, Reference Jennrich and Sampson1966). Furthermore, with a large value of the regularization parameter, the loading matrix enjoys the perfect simple structure. Meanwhile, the existing methods cannot always produce a loading matrix that is both sparse and simple. For example, the lasso produces a loading matrix that is sparse but not always simple.

The structural equation modeling (SEM) has been widely used in the social and behavioral sciences. The SEM covers a wide variety of statistical models, including the factor analysis model and the regression model. An analyst develops an assumption of causal relationship and determines whether the assumption is correct or not by testing the hypothesis or evaluating the goodness of fit indices. Recently, several researchers have proposed regularized structural equation models (Jacobucci et al., Reference Jacobucci, Grimm and McArdle2016; Huang et al., Reference Huang, Chen and Weng2017; Huang, Reference Huang2018). The analyst set lasso-type penalties to specific model parameters to conduct an automatic selection of the causal relationship, enhancing the flexibility in model specification. The application of the prenet to the SEM would be an interesting future research topic.

The lasso-type regularization extracts only the nonzero pattern of parameters. In some cases, the analyst needs to detect not only the nonzero pattern of parameters but also a more complex parameter structure. The penalty must be determined depending on the structure of the parameter. For example, when the analyst needs to estimate either $θ_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _1$$\end{document} or $θ_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _2$$\end{document} to be zero, the prenet penalty would be more useful than the lasso. More generally, when one of $θ_{1}, \dots, θ_{k}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _1,\dots ,\theta _k$$\end{document} is exactly or close to zero, we may use the Geomin-type penalty, $\prod_{j = 1}^{k} | θ_{j} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\prod _{j=1}^k|\theta _j|$$\end{document} . An application of a penalty that leads to structured sparsity would further enhance the flexibility of the analysis but beyond the scope of this research. We would like to take this as a future research topic.

Another interesting extension to the prenet penalization is the clustering of not only variables but also observations. This method is referred to as biclustering (e.g., Tan and Witten, Reference Tan and Witten2014; Flynn and Perry, Reference Flynn and Perry2020). To achieve this, we may need an entirely new formulation along with an algorithm to compute the optimal solution. This extension should also be a future research topic.

Supplemental Materials

Further numerical and theoretical investigations Analyses of resting-state fMRI data, electricity demand data, a loading matrix of big five data, and comparison with Geomin criterion.

R-package fanc R-package fanc containing code that performs our proposed algorithm.

Loadings Average of the estimated loading matrices for Monte Carlo simulations in Sect. 5 with excel files.

Acknowledgements

The authors would like to thank an associate editor and three reviewers for valuable comments and suggestions that improve the quality of the paper considerably. This research was supported in part by JSPS KAKENHI Grant (JP19K11862 and JP22H01139 to KH; JP20K19756, JP20H00601 to YT) and MEXT Project for Seismology toward Research Innovation with Data of Earthquake (STAR-E) Grant Number JPJ010217.

A. Proofs

A.1. Proof of Proposition 2

Because of Proposition 1, with the prenet, ${\hat{λ}}_{ij} {\hat{λ}}_{ik} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\lambda }}_{ij}{\hat{\lambda }}_{ik} = 0$$\end{document} as $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} . Thus, the prenet solution satisfies (9) as $ρ \to \infty$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho \rightarrow \infty $$\end{document} . We only need to show that the minimization problem of loss function $ℓ_{ML} (Λ, Ψ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _{\text {ML}}(\varvec{\Lambda },\varvec{\Psi })$$\end{document} is equivalent to that of $‖ S - Λ Λ^{T} ‖_{F}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \varvec{S} - \varvec{\Lambda }\varvec{\Lambda }^T\Vert _{F}^2$$\end{document} . The inverse covariance matrix of the observed variables is expressed as

\begin{matrix} Σ^{- 1} = Ψ^{- 1} - Ψ^{- 1} Λ {(Λ^{T} Ψ^{- 1} Λ + I_{m})}^{- 1} Λ^{T} Ψ^{- 1} . \end{matrix}

Because $Λ^{T} Λ = I_{m}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }^T\varvec{\Lambda } = \varvec{I}_m$$\end{document} , we obtain

\begin{matrix} Σ^{- 1} = α^{- 1} I_{p} - \frac{α^{- 2}}{α^{- 1} + 1} Λ Λ^{T} . \end{matrix}

The determinant of $Σ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }$$\end{document} can be calculated as

\begin{matrix} | Σ | = α^{p - m} {(1 + α)}^{m} . \end{matrix}

Then, the discrepancy function in (3) is expressed as

\begin{matrix} \frac{1}{2} (tr (α^{- 1} S) - \frac{α^{- 2}}{α^{- 1} + 1} tr (Λ^{T}, S, Λ) + p log α + m log (1 + \frac{1}{α}) - log | S | - p) . \end{matrix}

Because $α$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} is given and $‖ S - Λ Λ^{T} ‖_{F}^{2} = - 2 tr (Λ^{T}, S, Λ) + C .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \varvec{S} - \varvec{\Lambda }\varvec{\Lambda }^T\Vert _{F}^2 = -2\mathrm{tr} \left( \varvec{\Lambda }^T\varvec{S}\varvec{\Lambda } \right) + C.$$\end{document} with constant value C, we can derive (10).

A.2. Proof of Proposition 3

Recall that $\hat{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}$$\end{document} is an unpenalized estimator that satisfies $ℓ (\hat{θ}) = min_{θ \in Θ} ℓ (θ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\displaystyle {\ell }(\hat{\varvec{\theta }})=\min _{\varvec{\theta } \in \Theta } {\ell }(\varvec{\theta })$$\end{document} and ${\hat{θ}}_{q}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_q$$\end{document} is a quartimin solution obtained by the following problem.

\begin{matrix} min_{θ \in Θ} P_{qmin} (Λ), subject to ℓ (θ) = ℓ (\hat{θ}) . \end{matrix}

First, we show that

(A.1)

\begin{matrix} lim_{n \to \infty} d ({\hat{θ}}_{q}, Θ_{q}^{*}) = 0 a . s . \end{matrix}

From the assumptions, as the same manner of Chapter 6 in Pfanzagl (Reference Pfanzagl1994), we can obtain the following strong consistency.

(A.2)

\begin{matrix} lim_{n \to \infty} d (\hat{θ}, Θ_{*}) = 0 and lim_{n \to \infty} d ({\hat{θ}}_{ρ_{n}}, Θ_{*}) = 0 a . s ., \end{matrix}

where $Θ_{*} : = {θ \in Θ ∣ ℓ_{*} (θ) = {min}_{θ \in Θ} ℓ_{*} (θ)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _*:=\{\varvec{\theta }\in \Theta \mid \ell _*(\varvec{\theta })=\min _{\varvec{\theta }\in \Theta }\ell _*(\varvec{\theta })\}$$\end{document} . When ${lim}_{n \to \infty} d (\hat{θ}, Θ_{*}) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lim _{n\rightarrow \infty } d(\hat{\varvec{\theta }},\Theta _*)=0$$\end{document} , for all $ϵ > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon >0$$\end{document} , by taking n large enough, we have

\begin{matrix} ‖ \hat{Λ} - Λ_{*} ‖_{F} < ϵ a . s . \end{matrix}

for some ${(vec {(Λ_{*})}^{T}, diag {(Ψ_{*})}^{T}, vech {(Φ_{*})}^{T})}^{T} \in Θ_{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathrm{vec}(\varvec{\Lambda }_*)^T,\mathrm{diag}(\varvec{\Psi }_*)^T,\mathrm{vech}(\varvec{\Phi }_*)^T)^T\in \Theta _*$$\end{document} . From the uniform continuity of $P_{qmin}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_{\mathrm {qmin}}$$\end{document} on $Θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta $$\end{document} and the fact that $‖ \hat{Λ} T - Λ_{*} {T ‖}_{F} = {‖ \hat{Λ} - Λ_{*} ‖}_{F}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \hat{\varvec{\Lambda }}\varvec{T}-\varvec{\Lambda }_*\varvec{T}\Vert _{F}=\Vert \hat{\varvec{\Lambda }}-\varvec{\Lambda }_*\Vert _{F}$$\end{document} for any $T \in O (m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{T}\in {\mathcal {O}}(m)$$\end{document} , we have

(A.3)

\begin{matrix} sup_{T \in O (m)} | P_{qmin} (\hat{Λ} T) - P_{qmin} (Λ_{*} T) | < ϵ a . s . \end{matrix}

Write $\hat{T} : = {arg min}_{T \in O (m)} P_{qmin} (\hat{Λ} T)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{T}}:=\mathop {\arg \min }_{\varvec{T}\in {\mathcal {O}}(m)}P_{\mathrm {qmin}}(\hat{\varvec{\Lambda }}\varvec{T})$$\end{document} and $T_{*} : = {arg min}_{T \in O (m)} P_{qmin} (Λ_{*} T)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{T}_*:=\mathop {\arg \min }_{\varvec{T}\in {\mathcal {O}}(m)}P_{\mathrm {qmin}}(\varvec{\Lambda }_*\varvec{T})$$\end{document} . We have

\begin{matrix} P_{qmin} (\hat{Λ} \hat{T}) - P_{qmin} (Λ_{*} \hat{T}) \leq P_{qmin} (\hat{Λ} \hat{T}) - P_{qmin} (Λ_{*} T_{*}) \leq P_{qmin} (\hat{Λ} T_{*}) - P_{qmin} (Λ_{*} T_{*}) . \end{matrix}

From this, it follows that

\begin{matrix} | P_{qmin} (\hat{Λ} \hat{T}) - P_{qmin} (Λ_{*} T_{*}) | \leq sup_{T \in O (m)} | P_{qmin} (\hat{Λ} T) - P_{qmin} (Λ_{*} T) | . \end{matrix}

Thus, using (A.3), we obtain (A.1).

Next, as the similar manner of Proposition 15.1 in Foucart and Rauhut (Reference Foucart and Rauhut2013), we prove ${lim}_{n \to \infty} d ({\hat{θ}}_{ρ_{n}}, Θ_{q}^{*}) = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lim _{n\rightarrow \infty } d(\hat{\varvec{\theta }}_{\rho _n},\Theta _q^*)=0$$\end{document} , a.s. By the definition of ${\hat{θ}}_{ρ_{n}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{\varvec{\theta }}_{\rho _n}$$\end{document} , for any $ρ_{n} > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _n>0$$\end{document} we have

(A.4)

\begin{matrix} ℓ ({\hat{θ}}_{ρ_{n}}) + ρ_{n} P_{qmin} ({\hat{Λ}}_{ρ_{n}}) \leq ℓ ({\hat{θ}}_{q}) + ρ_{n} P_{qmin} ({\hat{Λ}}_{q}) \end{matrix}

and

(A.5)

\begin{matrix} ℓ ({\hat{θ}}_{ρ_{n}}) \geq ℓ ({\hat{θ}}_{q}) . \end{matrix}

Combining (A.1), (A.4), (A.5), we obtain

\begin{matrix} P_{qmin} ({\hat{Λ}}_{ρ_{n}}) \leq P_{qmin} ({\hat{Λ}}_{q}) \to P_{qmin} (Λ_{q}^{*}) a . s . \end{matrix}

for some ${(vec {(Λ_{q}^{*})}^{T}, diag {(Ψ_{q}^{*})}^{T}, vech {(Φ_{q}^{*})}^{T})}^{T} \in Θ_{q}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\mathrm{vec}(\varvec{\Lambda }^*_q)^T,\mathrm{diag}(\varvec{\Psi }^*_q)^T,\mathrm{vech}(\varvec{\Phi }^*_q)^T)^T\in \Theta ^*_q$$\end{document} . Therefore, we have

\begin{matrix} lim_{n \to \infty} P_{qmin} ({\hat{Λ}}_{ρ_{n}}) \leq P_{qmin} (Λ_{q}^{*}) a . s . \end{matrix}

As shown in (A.2), ${lim}_{n \to \infty} d ({\hat{θ}}_{ρ_{n}}, Θ_{*}) = 0 a . s .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lim _{n\rightarrow \infty } d(\hat{\varvec{\theta }}_{\rho _n},\Theta _*)=0\;\;a.s.$$\end{document} , and $Λ_{q}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }^{*}_q$$\end{document} is a minimizer of $P_{qmin} (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_{\mathrm{qmin}}(\cdot )$$\end{document} over $Θ_{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta _*$$\end{document} , so that the proof is complete.

B. Detail of the Algorithm

B.1. Update Equation of EM Algorithm for Fixed Tuning Parameters

We provide update equations of factor loadings and unique variances when $ρ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} and $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} are fixed. Suppose that $Λ_{old}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }_{\mathrm{old}}$$\end{document} , $Ψ_{old}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }_{\mathrm{old}}$$\end{document} , and $Φ_{old}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }_{\mathrm{old}}$$\end{document} are the current values of parameters. The parameter can be updated by minimizing the negative expectation of the complete-data penalized log-likelihood function with respect to $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} , $Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }$$\end{document} , and $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Phi }$$\end{document} (e.g., Hirose and Yamamoto, Reference Hirose and Yamamoto2014):

(A.6)

\begin{matrix} Q (Λ, Ψ, Φ) = & \sum_{i = 1}^{p} (log ψ_{i} + \frac{s_{ii} - 2 λ_{i}^{T} b_{i} + λ_{i}^{T} A λ_{i}}{ψ_{i}}) + log | Φ | + tr (Φ^{- 1} A) \\ + ρ P (Λ) + C, \end{matrix}

where C is a constant and

(A.7)

\begin{matrix} A = & M^{- 1} + M^{- 1} Λ_{old}^{T} Ψ_{old}^{- 1} S Ψ_{old}^{- 1} Λ_{old} M^{- 1}, \end{matrix}

(A.8)

\begin{matrix} b_{i} = & M^{- 1} Λ_{old}^{T} Ψ_{old}^{- 1} s_{i}, \end{matrix}

(A.9)

\begin{matrix} M = & Λ_{old}^{T} Ψ_{old}^{- 1} Λ_{old} + Φ_{old}^{- 1} . \end{matrix}

Here, $s_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{s}_i$$\end{document} is the ith column vector of $S$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{S}$$\end{document} .

In practice, minimization of (A.6) is difficult, because the prenet penalty consists of nonconvex functions. Therefore, we use a coordinate descent algorithm to obtain updated loading matrix $Λ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }_{\mathrm{new}}$$\end{document} . Let ${\tilde{λ}}_{i}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{{\lambda }}_{i}^{(j)}$$\end{document} be a ( $m - 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m-1$$\end{document} )-dimensional vector ${({\tilde{λ}}_{i 1}, {\tilde{λ}}_{i 2}, \dots, {\tilde{λ}}_{i (j - 1)}, {\tilde{λ}}_{i (j + 1)}, \dots, {\tilde{λ}}_{im})}^{T}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$( {\tilde{\lambda }}_{i1},{\tilde{\lambda }}_{i2},\dots ,{\tilde{\lambda }}_{i(j-1)},{\tilde{\lambda }}_{i(j+1)},\dots ,{\tilde{\lambda }}_{im})^T$$\end{document} . The parameter $λ_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{ij}$$\end{document} can be updated by maximizing (A.6) with the other parameters ${\tilde{λ}}_{i}^{(j)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{{\lambda }}_{i}^{(j)}$$\end{document} and with $Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }$$\end{document} being fixed, that is, we solve the following problem.

(A.10)

\begin{matrix} {\tilde{λ}}_{ij} = & \arg min_{λ_{ij}} \frac{1}{2 ψ_{i}} (a_{jj} λ_{ij}^{2} - 2 (b_{ij} - \sum_{k \neq j} a_{kj} {\tilde{λ}}_{ik}) λ_{ij}) \\ + ρ ((\frac{1}{2}, (1 - γ), \sum_{k \neq j}, {\tilde{λ}}_{ik}^{2}) λ_{ij}^{2} + (γ, \sum_{k \neq j}, | {\tilde{λ}}_{ik} |) | λ_{ij} |) \\ = & \arg min_{λ_{ij}} \frac{1}{2 ψ_{i}} ((a_{jj} + β) λ_{ij}^{2} - 2 (b_{ij} - \sum_{k \neq j} a_{kj} {\tilde{λ}}_{ik}) λ_{ij}) + ρ ξ | λ_{ij} | \\ = & \arg min_{λ_{ij}} \frac{1}{2} {(λ_{ij} - \frac{b_{ij} - \sum_{k \neq j} a_{kj} {\tilde{λ}}_{ik}}{a_{jj} + β})}^{2} + \frac{ψ_{i} ρ ξ}{a_{jj} + β} | λ_{ij} |, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} {\tilde{\lambda }}_{ij}= & {} \mathrm{arg} \min _{\lambda _{ij}} \frac{1}{2\psi _i} \left\{ a_{jj}\lambda _{ij}^2 - 2\left( b_{ij} - \sum _{k \ne j} a_{kj} {\tilde{\lambda }}_{ik} \right) \lambda _{ij} \right\} \nonumber \\&+ \rho \left[ \left\{ \frac{1}{2} (1-\gamma ) \sum _{k \ne j} {\tilde{\lambda }}_{ik}^2\right\} \lambda _{ij}^2 + \left( \gamma \sum _{k \ne j} |{\tilde{\lambda }}_{ik}| \right) |\lambda _{ij}| \right] \nonumber \\= & {} \mathrm{arg} \min _{\lambda _{ij}} \frac{1}{2\psi _i} \left\{ (a_{jj}+\beta )\lambda _{ij}^2 - 2\left( b_{ij} - \sum _{k \ne j} a_{kj} {\tilde{\lambda }}_{ik} \right) \lambda _{ij} \right\} + \rho \xi |\lambda _{ij}| \nonumber \\= & {} \mathrm{arg} \min _{\lambda _{ij}}\frac{1}{2} \left( \lambda _{ij} - \frac{b_{ij} - \sum _{k \ne j} a_{kj} {\tilde{\lambda }}_{ik} }{a_{jj} + \beta } \right) ^2 + \frac{\psi _i\rho \xi }{a_{jj}+\beta } |\lambda _{ij}|, \end{aligned}$$\end{document}

where

\begin{matrix} β = & ρ ψ_{i} (1 - γ) \sum_{k \neq j} {\tilde{λ}}_{ik}^{2}, \\ ξ = & γ \sum_{k \neq j} | {\tilde{λ}}_{ik} | . \end{matrix}

This is equivalent to minimizing the following penalized squared error loss function.

(A.11)

\begin{matrix} S (\tilde{θ}) = \arg min_{θ} (\frac{1}{2} {(θ - \tilde{θ})}^{2} + ρ^{*} | θ |) . \end{matrix}

The solution $S (\tilde{θ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S({\tilde{\theta }})$$\end{document} can be expressed in a closed form using the soft thresholding function:

(A.12)

\begin{matrix} S (\tilde{θ}) = sgn (\tilde{θ}) (| \tilde{θ} {| - ρ^{*})}_{+}, \end{matrix}

where $A_{+} = max (A, 0)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$A_+ = \max (A,0)$$\end{document} .

For given $Λ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }_{\mathrm{new}}$$\end{document} , the parameters of unique variances and factor correlations, say $Ψ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }_{\mathrm{new}}$$\end{document} and $Φ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Phi }_{\mathrm{new}}$$\end{document} , are expressed as

\begin{matrix} ψ_{i}^{new} = & s_{ii} - 2 {(λ_{i}^{new})}^{T} b_{i} + {(λ_{i}^{new})}^{T} A λ_{i}^{new} (i = 1, \dots, p), \\ Φ_{new} = & \underset{Φ}{\arg \min} {log | Φ | + tr (Φ^{- 1} A)}, \end{matrix}

where $ψ_{i}^{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi _i^{\mathrm{new}}$$\end{document} is the ith diagonal element of $Ψ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }_{\mathrm{new}}$$\end{document} , and $λ_{i}^{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\lambda }^{\mathrm{new}}_i$$\end{document} is the ith row of $Λ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }_{\mathrm{new}}$$\end{document} . The new value $Φ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Phi }_{\mathrm{new}} $$\end{document} may not be expressed in an explicit form, because all of the diagonal elements of $Φ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Phi }$$\end{document} are fixed by 1. Thus, $Φ_{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {\Phi }_{\mathrm{new}}$$\end{document} is obtained by the Broyden–Fletcher–Goldfarb–Shanno optimization procedure.

B.2. Algorithm Complexity

The complexity for our proposed algorithm for the orthogonal case is considered. To update $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} , we need a matrix $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document} in (A.7). The matrix computation of $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{A}$$\end{document} requires $O (p^{2} m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(p^2m)$$\end{document} operation (Zhao et al., Reference Zhao, Yu and Jiang2007). In the coordinate descent algorithm, we need to compute $\tilde{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{\theta }}$$\end{document} in (A.11) for each step, in which O(m) operation is required. For simplicity, we consider the case where the number of cycles of the coordinate descent algorithm is one. We remark that our algorithm converges to the (local) optima for this case because both EM and coordinate descent algorithms monotonically decrease the objective function at each iteration. In this case, we need O(m) to update $λ_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{ij}$$\end{document} ( $i = 1, \dots, p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots ,p$$\end{document} ; $j = 1, \dots, m$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\dots ,m$$\end{document} ), and therefore, $O (p m^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(pm^2)$$\end{document} operation is required to update $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} in the coordinate descent algorithm. As a result, the algorithm complexity to update $Λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Lambda }$$\end{document} is $O (p^{2} m) + O (p m^{2}) = O (p^{2} m)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(p^2m) + O(pm^2) = O(p^2m)$$\end{document} .

For the update of $Ψ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Psi }$$\end{document} , we need $O (p m^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(pm^2)$$\end{document} operation because the computation of ${(λ_{i}^{new})}^{T} A λ_{i}^{new}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varvec{\lambda }^{\mathrm{new}}_i)^T \varvec{A} \varvec{\lambda }^{\mathrm{new}}_i $$\end{document} in (A.13) requires $O (m^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(m^2)$$\end{document} operation for $i = 1, \dots, p$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots ,p$$\end{document} . Note that when the unpenalized maximum likelihood estimation is conducted, the order O(pm) is achieved (Zhao et al., Reference Zhao, Yu and Jiang2007); however, this order may not be achieved for the prenet penalization.

Footnotes

Supplementary Information The online version contains supplementary material available at doi.org/10.1007/s11336-022-09868-4.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Petrov, B. N. & Csaki, F. (Eds.), 2nd International symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado.Google Scholar

Anderson, T. W., & Rubin, H. (1956). Statistical inference in factor analysis. In Proceedings of the third Berkeley symposium on mathematical statistics and probability (Vol. 5).Google Scholar

Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 16(3), 397–438.CrossRef Google Scholar

Bernaards, C. A., & Jennrich, R. I. (2003). Orthomax rotation and perfect simple structure. Psychometrika, 68(4), 585–588.CrossRef Google Scholar

Bhlmann, P., & van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications (1st ed.). Berlin: Springer.CrossRef Google Scholar

Booth, T., & Hughes, D. J. (2014). Exploratory structural equation modeling of personality data. Assessment, 21(3), 260–271.CrossRef Google Scholar PubMed

Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research, 36(1), 111–150.CrossRef Google Scholar

Caner, M., & Han, X. (2014). Selecting the correct number of factors in approximate factor models: The large panel case with group bridge estimators. Journal of Business & Economic Statistics, 32(3), 359–374.CrossRef Google Scholar

Carroll, J. B. (1953). An analytical solution for approximating simple structure in factor analysis. Psychometrika, 18(1), 23–38.CrossRef Google Scholar

Choi, J., Zou, H., & Oehlert, G. (2011). A penalized maximum likelihood approach to sparse factor analysis. Statistics and Its Interface, 3(4), 429–436.CrossRef Google Scholar

Cudeck, R., & Henly, S. J. (1991). Model selection in covariance structures analysis and the" problem" of sample size: A clarification. Psychological Bulletin, 109(3), 512.CrossRef Google Scholar PubMed

Ding, C. H. Q., He, X., & Simon, H. D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. In SDM (Vol. 5, pp. 606–610). SIAM.Google Scholar

Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272–299.CrossRef Google Scholar

Flynn, C., & Perry, P. (2020). Profile likelihood biclustering. Electronic Journal of Statistics, 14(1), 731–768.CrossRef Google Scholar

Foucart, S., & Rauhut, H. (2013). A mathematical introduction to compressive sensing. Berlin: Springer.CrossRef Google Scholar

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.CrossRef Google Scholar PubMed

Goldberg, L. R. (1992). The development of markers for the big-five factor structure. Psychological assessment, 4(1), 26.CrossRef Google Scholar

Hattori, M., Zhang, G., & Preacher, K. J. (2017). Multiple local solutions and geomin rotation. Multivariate Behavioral Research, 52(6), 1–12.CrossRef Google Scholar PubMed

Hendrickson, A. E., & White, P. O. (1964). Promax: A quick method for rotation to oblique simple structure. British Journal of Statistical Psychology, 17(1), 65–70.CrossRef Google Scholar

Hirose, K., & Yamamoto, M. (2014). Estimation of an oblique structure via penalized likelihood factor analysis. Computational Statistics & Data Analysis, 79, 120–132.CrossRef Google Scholar

Hirose, K., & Yamamoto, M. (2015). Sparse estimation via nonconcave penalized likelihood in factor analysis model. Statistics and Computing, 25(5), 863–875.CrossRef Google Scholar

Huang, P.-H. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71(3), 499–522.CrossRef Google Scholar PubMed

Huang, P.-H., Chen, H., & Weng, L.-J. (2017). A penalized likelihood method for structural equation modeling. Psychometrika, 82(2), 329–354.CrossRef Google Scholar PubMed

Hui, F. K. C., Tanaka, E., & Warton, D. I. (2018). Order selection and sparsity in latent variable models via the ordered factor lasso. Biometrics, 74(4), 1311–1319.CrossRef Google Scholar PubMed

Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23(4), 555–566.CrossRef Google Scholar PubMed

Jennrich, R. I. (2004). Rotation to simple loadings using component loss functions: The orthogonal case. Psychometrika, 69(2), 257–273.CrossRef Google Scholar

Jennrich, R. I. (2006). Rotation to simple loadings using component loss functions: The oblique case. Psychometrika, 71(1), 173–191.CrossRef Google Scholar

Jennrich, R. I., & Sampson, P. F. (1966). Rotation for simple loadings. Psychometrika, 31(3), 313–313.CrossRef Google Scholar PubMed

Jöreskog, K. G., & Goldberger, A. S. (1971). Factor analysis by generalized least squaresd. ETS Research Bulletin Series, 1971 i–32.CrossRef Google Scholar

Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3), 187–200.CrossRef Google Scholar

Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36.CrossRef Google Scholar

Kiers, H. A. L. (1994). Simplimax: Oblique rotation to an optimal target with simple structure. Psychometrika, 59(4), 567–579.CrossRef Google Scholar

Lee, S.-Y. (1981). A Bayesian approach to confirmatory factor analysis. Psychometrika, 46(2), 153–160.CrossRef Google Scholar

Marsh, H. W., Lüdtke, O., Muthén, B., Asparouhov, T., Morin, A. J. S., Trautwein, U., & Nagengast, B. (2010). A new look at the big five factor structure through exploratory structural equation modeling. Psychological Assessment, 22(3), 471–491.CrossRef Google Scholar

Marsh, H. W., Nagengast, B., & Morin, A. J. S. (2013). Measurement invariance of big-five factors over the life span: ESEM tests of gender, age, plasticity, maturity, and la dolce vita effects. Developmental Psychology, 49(6), 1194–1218.CrossRef Google Scholar PubMed

Mazumder, R., Friedman, J., & Hastie, T. (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106, 1125–1138.CrossRef Google Scholar PubMed

Neuhaus, J. O., & Wrigley, C. (1954). The quartimax method: An analytical approach to orthogonal simple structure. British Journal of Statistical Psychology, 7(2), 81–91.CrossRef Google Scholar

Ning, L., & Georgiou, T. T. (2011). Sparse factor analysis via likelihood and

ℓ_{1}

regularization. In 50th IEEE conference on decision and control and European control conference (pp. 5188–5192).Google Scholar

Pfanzagl, J. (1994). Parametric statistical theory. Berlin, Boston: De Gruyter.CrossRef Google Scholar

Sass, D. A., & Schmitt, T. A. (2010). A comparative investigation of rotation criteria within exploratory factor analysis. Multivariate Behavioral Research, 45(1), 73–103.CrossRef Google Scholar PubMed

Scharf, F., & Nestler, S. (2019). Should regularization replace simple structure rotation in exploratory factor analysis?. Structural Equation Modeling, 26(4), 576–590.CrossRef Google Scholar

Scharf, F., & Nestler, S. (2019). A comparison of simple structure rotation criteria in temporal exploratory factor analysis for event-related potential data. Methodology, 15, 43–60.CrossRef Google Scholar

Schmitt, T. A., & Sass, D. A. (2011). Rotation criteria and hypothesis testing for exploratory factor analysis: Implications for factor pattern loadings and interfactor correlations. Educational and Psychological Measurement, 71(1), 95–113.CrossRef Google Scholar

Schwarz, G. (1978). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9, 1135–1151.Google Scholar

Srivastava, S., Engelhardt, B. E., & Dunson, D. B. (2017). Expandable factor analysis. Biometrika, 104(3), 649–663.CrossRef Google Scholar PubMed

Tan, K. M., & Witten, D. M. (2014). Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4), 985–1008.CrossRef Google Scholar PubMed

Thurstone, L. L. (1935). The vectors of mind. The vectors of mind. Chicago: University of Chicago Press.Google Scholar

Thurstone, L. L. (1947). Multiple factor analysis. Chicago: University of Chicago Press.Google Scholar

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.CrossRef Google Scholar

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.CrossRef Google Scholar

Trendafilov, N. T. (2013). From simple structure to sparse components: A review. Computational Statistics, 29(3–4), 431–454.CrossRef Google Scholar

Trendafilov, N. T., Fontanella, S., & Adachi, K. (2017). Sparse exploratory factor analysis. Psychometrika, 82(3), 778–794.CrossRef Google Scholar

Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.CrossRef Google Scholar

Yamamoto, M., & Jennrich, R. I. (2013). A cluster-based factor rotation. British Journal of Mathematical and Statistical Psychology, 66(3), 488–502.CrossRef Google Scholar PubMed

Yates, A. (1987). Multivariate exploratory data analysis: A perspective on exploratory factor analysis. New York: State University of New York Press.Google Scholar

Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894–942.CrossRef Google Scholar

Zhao, P., & Yu, B. (2007). On model selection consistency of lasso. Journal of Machine Learning Research, 7(2), 2541.Google Scholar

Zhao, J. H., Yu, P. L. H., & Jiang, Q. (2007). ML estimation for factor analysis: EM or non-EM?. Statistics and Computing, 18(2), 109–123.CrossRef Google Scholar

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.CrossRef Google Scholar

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67, 301–320.CrossRef Google Scholar

Zou, H., Hastie, T., & Tibshirani, R. (2007). On the degrees of freedom of the lasso. The Annals of Statistics, 35, 2173–2192.CrossRef Google Scholar