Skip to main content
Log in

Regularized Latent Class Analysis with Application in Cognitive Diagnosis

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Diagnostic classification models are confirmatory in the sense that the relationship between the latent attributes and responses to items is specified or parameterized. Such models are readily interpretable with each component of the model usually having a practical meaning. However, parameterized diagnostic classification models are sometimes too simple to capture all the data patterns, resulting in significant model lack of fit. In this paper, we attempt to obtain a compromise between interpretability and goodness of fit by regularizing a latent class model. Our approach starts with minimal assumptions on the data structure, followed by suitable regularization to reduce complexity, so that readily interpretable, yet flexible model is obtained. An expectation–maximization-type algorithm is developed for efficient computation. It is shown that the proposed approach enjoys good theoretical properties. Results from simulation studies and a real application are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37, 3099–3132.

    Article  Google Scholar 

  • American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: American Psychiatric Association.

    Google Scholar 

  • Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759–771.

    Article  Google Scholar 

  • Chen, Y., Liu, J., Xu, G., & Ying, Z. (2015a). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110, 850–866.

    Article  PubMed  Google Scholar 

  • Chen, Y., Liu, J., & Ying, Z. (2015b). Online item calibration for Q-matrix in CD-CAT. Applied Psychological Measurement, 39, 5–15.

    Article  Google Scholar 

  • Croon, M. (1990). Latent class analysis with ordered latent classes. British Journal of Mathematical and Statistical Psychology, 43, 171–192.

    Article  Google Scholar 

  • Croon, M. (1991). Investigating mokken scalability of dichotomous items by means of ordinal latent class analysis. British Journal of Mathematical and Statistical Psychology, 44, 315–331.

    Article  Google Scholar 

  • Dalrymple, K., & D’Avanzato, C. (2013). Differentiating the subtypes of social anxiety disorder. Expert Review of Neurotherapeutics, 13, 1271–1283.

    Article  PubMed  Google Scholar 

  • de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199.

    Article  Google Scholar 

  • de la Torre, J., & Douglas, J. (2004). Higher order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353.

    Article  Google Scholar 

  • DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques. In R. L. B. Paul, D. Nichols, & Susan F. Chipman (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

    Article  Google Scholar 

  • Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75, 531–552.

    Article  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.

    Article  PubMed  PubMed Central  Google Scholar 

  • Goodman, L. A. (1974a). The analysis of systems of qualitative variables when some of the variables are unobservable. Part I—a modified latent structure approach. American Journal of Sociology, 79, 1179–1259.

    Article  Google Scholar 

  • Goodman, L. A. (1974b). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.

    Article  Google Scholar 

  • Grant, B. F., Kaplan, K., Shepard, J., & Moore, T. (2003). Source and accuracy statement for Wave 1 of the 2001–2002 National Epidemiologic Survey on Alcohol and Related Conditions. Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism.

    Google Scholar 

  • Haberman, S. J., von Davier, M. & Lee, Y.-H. (2008) Comparison of multidimensional item response models: Multivariate normal ability distributions versus multivariate polytomous ability distributions. ETS Research Rep. No. RR-08-45. Princeton, NJ: ETS.

  • Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301–321.

    Article  Google Scholar 

  • Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.

    Article  Google Scholar 

  • Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272.

    Article  Google Scholar 

  • Kessler, R. C., Berglund, P., Demler, O., Jin, R., Merikangas, K. R., & Walters, E. E. (2005). Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the national comorbidity survey replication. Archives of General Psychiatry, 62, 593–602.

    Article  PubMed  Google Scholar 

  • Lazarsfeld, P. F., Henry, N. W., & Anderson, T. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin.

    Google Scholar 

  • Lehmann, E. L., & Casella, G. (2006). Theory of point estimation. New York: Springer.

    Google Scholar 

  • Leighton, J., & Gierl, M. (2007). Cognitive diagnostic assessment for education: Theory and applications. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237.

    Article  Google Scholar 

  • Li, X., Liu, J. & Ying, Z. (2016). Chernoff index for cox test of separate parametric families. arXiv:1606.08248.

  • Liu, J., Xu, G., & Ying, Z. (2012). Data-driven learning of Q-matrix. Applied Psychological Measurement, 36, 548–564.

    Article  PubMed  PubMed Central  Google Scholar 

  • Liu, J., Xu, G., & Ying, Z. (2013). Theory of self-learning Q-matrix. Bernoulli, 19, 1790–1817.

    Article  PubMed  PubMed Central  Google Scholar 

  • Nishii, R., et al. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics, 12, 758–765.

    Article  Google Scholar 

  • Rupp, A., & Templin, J. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspective, 6, 219–262.

    Google Scholar 

  • Rupp, A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.

    Google Scholar 

  • Stein, M. B., & Stein, D. J. (2008). Social anxiety disorder. The Lancet, 371, 1115–1125.

    Article  Google Scholar 

  • Tatsuoka, C. (2002). Data analytic methods for latent partially ordered classification models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51, 337–350.

    Article  Google Scholar 

  • Tatsuoka, C., & Ferguson, T. (2003). Sequential classification on partially ordered sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 143–157.

    Article  Google Scholar 

  • Tatsuoka, K. (1985). A probabilistic model for diagnosing misconceptions in the pattern classification approach. Journal of Educational Statistics, 12, 55–73.

    Google Scholar 

  • Tatsuoka, K. (2009). Cognitive assessment: An introduction to the rule space method. New York: Routledge.

    Google Scholar 

  • Templin, J., & Henson, R. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305.

    Article  PubMed  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58, 267–288.

    Google Scholar 

  • von Davier, M. (2005) A general diagnosis model applied to language testing data. ETS Research Rep. No. RR-05-16. Princeton, NJ: ETS.

  • von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307.

    Article  Google Scholar 

  • von Davier, M. (2014). The DINA model as a constrained general diagnostic model: Two variants of a model equivalency. British Journal of Mathematical and Statistical Psychology, 67, 49–71.

    Article  Google Scholar 

  • von Davier, M., & Haberman, S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional ‘diagnostic’ classification models—a commentary. Psychometrika, 79, 340–346.

    Article  Google Scholar 

  • von Davier, M. & Yamamoto, K. (2004) A class of models for cognitive diagnosis. In 4th Spearman Conference, Philadelphia, PA.

  • Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 671–683.

    Article  Google Scholar 

  • Wang, H., Li, R., & Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.

    Article  PubMed  PubMed Central  Google Scholar 

  • Wang, T., & Zhu, L. (2011). Consistent tuning parameter selection in high dimensional sparse linear regression. Journal of Multivariate Analysis, 102, 1141–1151.

    Article  Google Scholar 

  • Xu, G. (2016). Identifiability of restricted latent class models with binary responses. arXiv:1603.04140.

Download references

Acknowledgments

This work was supported by NSF (grant nos. SES-1323977, IIS-1633360), Army Research Office (grant no. W911NF-15-1-0159), and NIH (grant no. R01GM047845).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingchen Liu.

Appendices

Appendix 1: Estimation Via the Expectation–Maximization Algorithm

We propose to use the expectation-maximization (EM) algorithm combined with the coordinate descent algorithm for the computation of the regularized estimator in (10) for given \(\lambda \) and M. The algorithm guarantees a monotone increasing objective function. Given initial values \(\mathbf {c}\) and \(\varvec{\pi }\), the algorithm iterates between the E-step and M-step until convergence.

1.1 E-step

In the E-Step, one computes the Q-function,

$$\begin{aligned} Q(\mathbf {c}^*, \varvec{\pi }^* | \mathbf {c}, \varvec{\pi })=E_{\mathbf {c},\varvec{\pi }}\{\log L( \mathbf {c}^*, \varvec{\pi }^*; \mathbf R_i, m_i,i=1,\ldots ,N)\mid \mathbf R_i,i=1,\ldots ,N\}. \end{aligned}$$
(14)

The expectation is taken with respect to \(m_i, i=1,\ldots ,N\). The notation \(E_{\mathbf {c},\varvec{\pi }}\) denotes the conditional distribution corresponding to parameters \(\mathbf {c}\) and \(\varvec{\pi }\). The complete data log-likelihood function is

$$\begin{aligned} \log L(\mathbf {c}^*, \varvec{\pi }^*;\mathbf R_i,m_i)=\sum _{i=1}^N\sum _{j=1}^J R^{j}_i \log c^*_{j,m_i}+(1-R^j_i)\log (1-c^*_{j,m_i}) + \sum _{i=1}^N \log (\pi ^*_{m_i}). \end{aligned}$$

Under the posterior distribution, \(m_i\), \(i=1,\ldots ,N\) are independent and the posterior distribution associated with the parameters \(\mathbf {c}\) and \(\varvec{\pi }\) is

$$\begin{aligned} \begin{aligned} q_{im} :=&\; P_{\mathbf {c},\varvec{\pi }}(m_i = m\vert \mathbf R_l, l = 1\ldots ,M)\\ =&\; P_{\mathbf {c},\varvec{\pi }}(m_i = m\vert \mathbf R_i)\\ \varpropto&\;\prod _{j=1}^J c_{j,m}^{R_{ij}} (1-c_{j,m})^{1-R_{ij}}\pi _m. \end{aligned} \end{aligned}$$

The Q-function takes the following additive form,

$$\begin{aligned} \begin{aligned}&Q(\mathbf {c}^*, \varvec{\pi }^* \vert \mathbf {c}, \varvec{\pi })\\ =&\sum _{j=1}^J\sum _{i=1}^N \sum _{m=1}^M q_{im}\big [R^{j}_i \log c^*_{j,m}+(1-R^j_i)\log (1-c^*_{j,m})\big ] + \sum _{m=1}^M\sum _{i=1}^N q_{im} \log \pi ^*_m. \end{aligned} \end{aligned}$$
(15)

1.2 M-step

The M-step consists of maximizing the regularized Q-function with respect to \((\mathbf {c}^*, \varvec{\pi }^*)\)

$$\begin{aligned} \max _{\mathbf {c}^*,\varvec{\pi }^*}Q(\mathbf {c}^*, \varvec{\pi }^* \mid \mathbf {c}, \varvec{\pi })-N\kappa _{\lambda }(\mathbf {c}^*). \end{aligned}$$

Note that in the objective function, the term

$$\begin{aligned} \sum _{m=1}^M\sum _{n=1}^N q_{im} \log \pi ^*_m \end{aligned}$$

consists only of \(\varvec{\pi }^*\), and for each j the term

$$\begin{aligned} \sum _{i=1}^N\sum _{m=1}^{M} q^*_{im}\Big (R^{j}_i \log c^*_{j,m}+(1-R^j_i)\log (1-c^*_{j,m})\Big )-N \sum _{m=1}^{M-1}p_\lambda ^{SCAD}\big (c^*_{j,(m+1)}-c^*_{j(m)}\big ) \end{aligned}$$

consists only of \(\mathbf {c}_{j}^*\). Therefore, we can maximize the Q-function w.r.t. \(\varvec{\pi }^*\) and each \(\mathbf {c}_j^*\) independently. In particular,

$$\begin{aligned} \varvec{\pi }^\dagger = \arg \max _{\varvec{\pi }^*}\sum _{m=1}^M\sum _{n=1}^N q_{im} \log \pi ^*_m \end{aligned}$$

can be computed as follows:

$$\begin{aligned} \pi ^{\dagger }_m = \frac{\sum _{i=1}^N q_{im}}{\sum _{l=1}^M\sum _{i=1}^Nq_{il}}, \ \ m = 1, \ldots , M. \end{aligned}$$

We maximize

$$\begin{aligned} Q_j(\mathbf {c}^*_{j})=\sum _{m=1}^{M} a^j_m\log c^*_{j,m}+b^j_m\log (1-c^*_{j,m})-N\sum _{m=1}^{M-1}p_\lambda ^{SCAD}\big (c^*_{j,(m+1)}-c^*_{j(m)}\big ), \end{aligned}$$
(16)

where \(a^j_m=\sum _{i=1}^N q_{im}R_i^j\) and \(b^j_m=\sum _{i=1}^N q_{im}(1-R^j_i)\). Here, \(a^j_m\) represents the expected number of respondents who are from latent class m and have responded correctly to item j, and \(b^j_m\) represents the expected number of respondents who are from latent class m and have responded incorrectly to item j, given the responses and the current parameter estimates.

Let

$$\begin{aligned} \mathbf {c}^{\dagger }_j=\arg \max _{\mathbf {c}_j}Q_j(\mathbf {c}_j). \end{aligned}$$
(17)

We first show the result for the order of \(c^\dagger _{j,m}, m=1,\ldots ,M.\)

Proposition 1

Let \(x^*_{j,m}=\frac{a^j_m}{a^j_m+b^j_m}\) and \(c^\dagger _{j,m}\) be defined in (17), \(j=1,\ldots ,J\), \(m=1,\ldots ,M\). Then for each j, the order of \(c^\dagger _{j,1},\ldots ,c^\dagger _{j,M}\) is the same as that of \(x^*_{j,1},\ldots ,x^*_{j,M}\). That is, for \(l\ne s, 1\le l,s\le M\), if \(x^*_{j,l}\ge x^*_{j,s}\) then \(c^\dagger _{j,l}\ge c^\dagger _{j,s}\).

Because of this proposition, the computation in (17) is greatly simplified. That is, instead of looking for the solution on the whole domain \([0,1]^M\), we only need to focus on a much smaller subspace (whose volume is 1 / (M!)) that is decided by the order of \(x^*_{j,1}, \ldots , x^*_{j,M}\). On knowing the order of \(c^\dagger _{j,1},\ldots ,c^\dagger _{j,M}\), we parameterize the maximization problem by the order statistics. For instance, if \(x^*_{j,1}<\cdots <x^*_{j,M}\), then \(c^\dagger _{j,(m)}=c^\dagger _{j,m}\). In this case, we write

$$\begin{aligned} Q^r_j(c^*_{j,(1)},d_1,\ldots ,d_{M-1})= & {} \sum _{m=1}^{M}\left[ a^j_m\log \left( c^*_{j,(1)}+\sum _{l=1}^{m-1} d_l\right) +b^j_m\log \left( 1-c^*_{j,(1)}-\sum _{l=1}^{m-1} d_l\right) \right] \\&~~~-N\sum _{m=1}^{M-1}p_\lambda ^{SCAD}(d_m), \end{aligned}$$

where \(d_l = c_{j,(l+1)}^* - c_{j,(l)}^*\). Then we apply the coordinate descent algorithm to the reparametrized function \(Q^r_j(c_{j,(1)},d_1,\ldots ,d_{M-1})\) subject to the constraint that \(c_{j,(1)},d_1,\ldots ,d_{M-1}\ge 0\) and \(c_{j,(1)}+\sum _{m=1}^{M-1}d_m\le 1\). For more details about the coordinate descent algorithm, see Friedman, Hastie, and Tibshirani (2010).

Appendix 2: Proof of Proposition 1

Proof

For simplicity of notation, we assume \(M=2\) and \(x^*_{j,1}\le x^*_{j,2}\). For \(M>2\), the proof is similar. Assume to the contrary that \(c^\dagger _{j,1}>c^\dagger _{j,2}\). Then according to (17)

$$\begin{aligned} Q_j\big (c^\dagger _{j,1},c^\dagger _{j,2}\big )\ge Q_j\big (c^\dagger _{j,1},c^\dagger _{j,1}\big ) \text{ and } Q_j\big (c^\dagger _{j,1},c^\dagger _{j,2}\big )\ge Q_j\big (c^\dagger _{j,2},c^\dagger _{j,2}\big ). \end{aligned}$$

According to (16), this can be simplified to

$$\begin{aligned} a^j_2\log c^\dagger _{j,2}+b^{j}_2\log \big (1-c^\dagger _{j,2}\big )-2p_{\lambda }^{SCAD}\big (c^\dagger _{j,1}-c^\dagger _{j,2}\big )\ge & {} a^j_2\log c^\dagger _{j,1}+b^{j}_2\log \big (1-c^\dagger _{j,1}\big ) \nonumber \\ \end{aligned}$$
(18)
$$\begin{aligned} a^j_1\log c^\dagger _{j,1}+b^{j}_1\log \big (1-c^\dagger _{j,1}\big )-2p_{\lambda }^{SCAD}\big (c^\dagger _{j,1}-c^\dagger _{j,2}\big )\ge & {} a^j_1\log c^\dagger _{j,2}+b^{j}_1\log \big (1-c^\dagger _{j,2}\big )\nonumber \\ \end{aligned}$$
(19)

Because \(p_{\lambda }^{SCAD}(c^\dagger _{j,1}-c^\dagger _{j,2})\ge 0\), (18) and (19) are still true by removing the term \(-2p_{\lambda }^{SCAD}(c^\dagger _{j,1}-c^\dagger _{j,2})\). According to the definition of \(x^*_{j,1}\) and \(x^*_{j,2}\), we have

$$\begin{aligned} x^*_{j,2}\log c^\dagger _{j,2}+(1-x^*_{j,2})\log \big (1- c^\dagger _{j,2}\big )\ge & {} x^*_{j,2}\log c^\dagger _{j,1}+(1-x^*_{j,2})\log \big (1- c^\dagger _{j,1}\big )\\ x^*_{j,1}\log c^\dagger _{j,1}+(1-x^*_{j,1})\log \big (1- c^\dagger _{j,1}\big )\ge & {} x^*_{j,1}\log c^\dagger _{j,2}+(1-x^*_{j,1})\log \big (1- c^\dagger _{j,2}\big ) \end{aligned}$$

Adding these two inequalities up gives

$$\begin{aligned} (x^*_{j,2}-x^*_{j,1})\Big (\log c^\dagger _{j,2}-\log \big (1-c^\dagger _{j,2}\big )-\log c^\dagger _{j,1}+\log \big (1-c^\dagger _{j,1}\big )\Big )>0. \end{aligned}$$

Therefore

$$\begin{aligned} \log c^\dagger _{j,2}-\log \big (1-c^\dagger _{j,2}\big )>\log c^\dagger _{j,1}-\log \big (1-c^\dagger _{j,1}\big ). \end{aligned}$$
(20)

However, the function \(\log x-\log (1-x)\) is strictly increasing for \(x\in (0,1)\), so (20) is impossible. This finishes the proof. \(\square \)

Appendix 3: Proof of Theorem 1

Proof

Throughout the proof, we write \(a_N=o(b_N)\) for two sequence of vectors \(a_N\) and \(b_N\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) tend to zero and \(a_N=O(b_N)\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) is bounded when N varies. Moreover, for two sequences of random vectors \(a_N\) and \(b_N\), we write \(a_N=O_P(b_N)\) if \(\Vert a_N\Vert /\Vert b_N\Vert \) converges to zero in probability and \(a_N=O_P(b_N)\) if \(\Vert a_N\Vert /\Vert b_{N}\Vert \) is tight in probability. To simplify the notation, we denote the true model parameters as \((\mathbf {c}, \varvec{\pi })\) and write \(\varvec{\theta }=(\mathbf {c},\varvec{\pi }_{-1})\), \(\hat{\varvec{\theta }}=(\hat{\mathbf {c}}^{\lambda _N},\hat{\varvec{\pi }}_{-1}^{\lambda _N})\) and \(\varvec{\theta }'=(\mathbf {c}',\varvec{\pi }_{-1}')\). Note that the event \(\Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert \ge \frac{C}{\sqrt{N}}\) implies

$$\begin{aligned} \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \frac{C}{\sqrt{N}},\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda }(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda }(\mathbf {c}). \end{aligned}$$

Therefore, it is sufficient to show that for each \(\varepsilon >0\), there exists a sufficiently large constant C such that

$$\begin{aligned} \limsup _{N\rightarrow \infty }P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \frac{C}{\sqrt{N}},\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda }(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda }(\mathbf {c})\right) \le \varepsilon . \end{aligned}$$

We split the probability above into two parts,

$$\begin{aligned} P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \frac{C}{\sqrt{N}},\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda }(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda }(\mathbf {c}) \right) \le I_1 + I_2, \end{aligned}$$

where

$$\begin{aligned} I_1= P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c}) \right) \end{aligned}$$

and

$$\begin{aligned} I_2= P\left( \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c}) \right) . \end{aligned}$$

Here, \(\varepsilon _1\) is a positive constant independent of N, whose value will be chosen later. We present upper bounds for \(I_1\) and \(I_2\) separately. The next lemma, whose proof is given in Appendix 5, provides an upper bound for \(I_1\).

Lemma 1

For any fixed \(\varepsilon _1>0\), there exists a positive constant \(\varepsilon _2\) (depending on \(\varepsilon _1)\) such that for sufficiently large N, we have \( I_1\le e^{-\varepsilon _2 N}. \)

We proceed to the \(I_2\) term. We first analyze

$$\begin{aligned} \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c}')+N\kappa _{\lambda _N}(\mathbf {c})\}. \end{aligned}$$

It is straightforward to check that for \(\varvec{\theta }'\in \Theta \), there exists a sufficiently large positive constant \(\eta \) such that

$$\begin{aligned} \begin{aligned} \Vert \nabla l(\varvec{\theta }')\Vert \le \eta N,~ \Vert \nabla ^2 l(\varvec{\theta }') \Vert \le \eta N \text{ and } \Vert \nabla ^3 l(\varvec{\theta }') \Vert \le \eta N, \end{aligned} \end{aligned}$$
(21)

where \(\nabla ^2l\) and \(\nabla ^3 l\) denote vectors consisting of all second and third partial derivatives of l, respectively. According to (21), we compute the Taylor expansion of \(l(\varvec{\theta }')\) around \(\varvec{\theta }\) for \(\varvec{\theta }'\in \Theta \)

$$\begin{aligned} \begin{aligned} l(\varvec{\theta }')-l(\varvec{\theta })=&(\varvec{\theta }'-\varvec{\theta })^{\top }\nabla l(\varvec{\theta })- \frac{1}{2}N (\varvec{\theta }'-\varvec{\theta })^{\top }I(\varvec{\theta })(\varvec{\theta }'-\varvec{\theta })\\&+\,O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N}) +O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N). \end{aligned} \end{aligned}$$
(22)

In (22), the term \(O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N})\) corresponds to the remainder term for the second derivatives at \(\varvec{\theta }\) and the term \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N)\) corresponds to the terms involving third derivatives. Note that for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1\), there exists a positive constant \(C_2\), independent of \(\varepsilon _1\), such that \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3N)\le C_2 \varepsilon _1 \Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2N\). Thus, the \(O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N)\) term is dominated by the second term, that is,

$$\begin{aligned} -\frac{1}{2}N (\varvec{\theta }'-\varvec{\theta })^{\top }I(\varvec{\theta })(\varvec{\theta }'-\varvec{\theta })+O(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^3 N)\le -\frac{1}{2}\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2 N \Big (\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v -\varepsilon _1C_2\Big ).\qquad \end{aligned}$$
(23)

Also note that \(O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N})=O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \sqrt{N})\), and \(\nabla l(\varvec{\theta })=O_P(\sqrt{N})\). Thus,

$$\begin{aligned} (\varvec{\theta }'-\varvec{\theta })^{\top }\nabla l(\varvec{\theta })+O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2\sqrt{N})=O_P(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \sqrt{N}). \end{aligned}$$
(24)

Combining (22), (23) and (24) gives

$$\begin{aligned} \begin{aligned}&\sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1} l(\varvec{\theta }')-l(\varvec{\theta })\\&\quad \le \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1} \Vert \varvec{\theta }'-\varvec{\theta }\Vert O_P(\sqrt{N})-\frac{\Vert \varvec{\theta }'-\varvec{\theta }\Vert ^2N}{2}\Big (\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v -\varepsilon _1C_2\Big ), \end{aligned} \end{aligned}$$

which is further bounded above by

$$\begin{aligned} \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1} l(\varvec{\theta }')-l(\varvec{\theta }) \le \frac{C}{\sqrt{N}}O_P(\sqrt{N}) -\frac{C^2}{2}\Big (\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v -\varepsilon _1C_2\Big ). \end{aligned}$$

Therefore, by choosing \(\varepsilon _1\) sufficiently small, we have

$$\begin{aligned} \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1} l(\varvec{\theta }')-l(\varvec{\theta })\le -\frac{C^2}{4}\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v + O_P(1)C. \end{aligned}$$
(25)

We proceed to the penalty term. For simplicity of discussion, we only state the proof for the case where there is no \(j\in \{1,\ldots ,J\}\) such that \(c_{j,1}=c_{j,2}=\cdots =c_{j,M}\). That is, all items have discrimination power. When there are items that have the same item response function among all the latent classes, the proof is similar, and is thus omitted.

Define a function \(\mathbf {gap}(\varvec{\beta })=\min \{|\beta _i-\beta _j|:\beta _i\ne \beta _j, i=1,\ldots ,M, j=1,\ldots ,M \}\), where \(\varvec{\beta }=(\beta _1,\ldots ,\beta _M)\in R^M\) and there exist i and j such that \(\beta _i\ne \beta _j\). Note that the difference of order statistics \(c_{j,(m+1)}-c_{j,(m)}\) is either zero or greater than \(\frac{\mathbf {gap}(\mathbf {c}_j)}{4}\). Recall in the definition (9), \(p_{\lambda _n}^{SCAD}(x)=\frac{(a+1)^2\lambda ^2}{2}\) for all \(|x|\ge a\lambda \). Thus, the penalty term \(p_{\lambda _N}^{SCAD}(c_{j,(m+1)}-c_{j,(m)})\) is either 0 (when \(c_{j,(m+1)}-c_{j,(m)}=0\)) or \(\frac{(a+1)^2\lambda ^2}{2}\) (when \(c_{j,(m+1)}-c_{j,(m)}>0\)) for N sufficiently large such that \(\lambda _N<\frac{\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)}{4a}\). Therefore,

$$\begin{aligned} \begin{aligned} \kappa _{\lambda _N}(\mathbf {c})&=\sum _{j=1}^J\sum _{m=1}^{M-1} p_{\lambda _N}(c_{j,(m+1)}-c_{j,(m)})\\&=\frac{(a+1)^2\lambda ^2}{2}\sum _{j=1}^J \mathrm {Card}(\{ m: c_{j,(m+1)}-c_{j,(m)}>0 \}), \end{aligned} \end{aligned}$$
(26)

where \(\mathrm {Card}(\cdot )\) denotes the number of elements in a set. On the other hand, we have the following lemma on \(\kappa _{\lambda _N}(\mathbf {c}')\), whose proof is given in Appendix 5.

Lemma 2

If \(\Vert \mathbf {c}'-\mathbf {c}\Vert <\frac{1}{4}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\) and \(\lambda _N\le \frac{1}{4a}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\), then

$$\begin{aligned} \kappa _{\lambda _N}(\mathbf {c}')\ge \frac{(a+1)^2\lambda ^2}{2}\sum _{j=1}^J \mathrm {Card}(\{ m: c_{j,(m+1)}-c_{j,(m)}>0 \}). \end{aligned}$$

The above lemma and (26) show that \( \kappa _{\lambda _N}(\mathbf {c}')-\kappa _{\lambda _N}(\mathbf {c})\ge 0 \) for \(\lambda _N \le \frac{\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)}{4a}.\) Combine this with (25), we have that for sufficiently large N,

$$\begin{aligned} \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1}\{ l(\varvec{\theta }')-l(\varvec{\theta })-N(\kappa _{\lambda _N}(\mathbf {c}')-\kappa _{\lambda _N}(\mathbf {c}))\} \le -\frac{C^2}{4}\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v + O_P(1)C. \end{aligned}$$

Note that \(\inf _{\Vert v\Vert =1}v^{\top }I(\varvec{\theta })v \) is equal to the smallest eigenvalue of \(I(\varvec{\theta })\), which is positive by Assumption A3. Therefore, we have

$$\begin{aligned} I_2=P\left( \sup _{\frac{C}{\sqrt{N}}\le \Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c}) \right) \le \frac{\varepsilon }{2} \end{aligned}$$

for C sufficiently large. Combining our results for \(I_1\) and \(I_2\), we conclude the proof. \(\square \)

Appendix 4: Proof of Theorem 2

Proof

We first present a useful lemma, whose proof is given in Appendix 5.

Lemma 3

There exist constants C and \(C_1\) such that

$$\begin{aligned} P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}}\Vert \nabla l(\varvec{\theta }')\Vert \le C_1\sqrt{N},\Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\right) >1-\varepsilon \end{aligned}$$

for sufficiently large N.

Let the event \(\Omega _1=\Big \{\sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}}\Vert \nabla l(\varvec{\theta }')\Vert \le C_1\sqrt{N},\Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\Big \}\). It is sufficient to show that on the event \(\Omega _1\), \(\hat{\mathbf {c}}^{\lambda _N}\) and \(\mathbf {c}\) have the same partially merged pattern for N large enough. We prove this by contradiction. Assume that on the contrary, the partially merged pattern of \(\hat{\mathbf {c}}^{\lambda _N}\) and \(\mathbf {c}\) are different, we will construct a \(\tilde{\varvec{\theta }}\in \Theta \) such that

$$\begin{aligned} l(\tilde{\varvec{\theta }})-N\kappa _{\lambda _N}(\tilde{\mathbf {c}})>l(\hat{\varvec{\theta }})-N\kappa _{\lambda _N}(\hat{\mathbf {c}}^{\lambda _N}), \end{aligned}$$
(27)

which contradicts the definition of \(\hat{\varvec{\theta }}\). Without loss of generality, assume that \(\hat{\mathbf {c}}_1^{\lambda _N}\) and \(\mathbf {c}_1\) have different partially merged patterns. That is, there exist \(m_1,m_2\in \{1,\ldots ,M\}\) such that \(c_{1,m_1}\le c_{1,m_2}\) but \(\hat{c}^{\lambda _N}_{1,m_1}>\hat{c}^{\lambda _N}_{1,m_2}\). There are two cases: (1) \(c_{1,m_1}<c_{1,m_2}\) and (2) \(c_{1,m_1}=c_{1,m_2}\). Because on the event \(\Omega _1\), \(|\hat{c}^{\lambda _N}_{1,m_i}-c_{1,m_i}|<\frac{C}{\sqrt{N}}\) \((i=1,2)\), the first case is not possible when N is sufficiently large. Thus, we only need to consider the second case where \(c_{1,m_1}=c_{1,m_2}\) and \(\hat{c}^{\lambda _N}_{1,m_1}>\hat{c}^{\lambda _N}_{1,m_2}\). Define two sets of indices as follows:

$$\begin{aligned} A=\Big \{m_2\in \{1,\ldots ,M\}: \exists m_1\in \{1,\ldots ,M\} \text{ such } \text{ that } c_{1,m_1}=c_{1,m_2} \text{ and } \hat{c}_{1,m_1}^{\lambda _N}>\hat{c}^{\lambda _N}_{1,m_2} \Big \}, \end{aligned}$$

and

$$\begin{aligned} B=\Big \{l\in \{1,\ldots ,M\}: \hat{c}^{\lambda _N}_{1,l}=\min _{m\in A} \hat{c}_{1,m}^{\lambda _N} \Big \}. \end{aligned}$$

The set B is a subset of A, collecting the indices that \(\hat{c}^{\lambda _N}_{1,m}\) is minimized. Due to the assumption above, both A and B are non-empty sets. Now we construct \(\tilde{\mathbf {c}}\) as follows:

$$\begin{aligned} \tilde{c}_{1,m}=\left\{ \begin{array}{ll} \hat{\mathbf {c}}^{\lambda _N}_{1,m}+\Delta &{}\quad \text{ if } m\in B\\ \hat{\mathbf {c}}^{\lambda _N}_{1,m}&{}\quad \text{ if } m\notin B \end{array},\right. \end{aligned}$$

where \(\Delta \) is a sufficiently small positive number that will be chosen later. For \(j=2,\ldots ,J\) and \(m=1,\ldots ,M\), we keep \(\tilde{c}_{j,m}=\hat{c}_{j,m}^{\lambda _N}\). We also set \(\tilde{\varvec{\pi }}_{-1}=\hat{\varvec{\pi }}_{-1}^{\lambda _N}\). That is, \(\tilde{\varvec{\theta }}\) and \(\hat{\varvec{\theta }}\) are the same except for \(\tilde{c}_{1,m}\) where \(m\in B\). We proceed to compare \(l(\tilde{\varvec{\theta }})-N\kappa _{\lambda _N}(\tilde{\mathbf {c}})\) and \(l(\hat{\varvec{\theta }})-N\kappa _{\lambda _N}(\hat{\mathbf {c}}^{\lambda _N})\). Because \(\tilde{\varvec{\theta }}\) and \(\tilde{\mathbf {c}}\) depend on \(\Delta \), we write \(\tilde{\varvec{\theta }}(\Delta )\) and \(\tilde{\mathbf {c}}(\Delta )\) to indicate this dependence.

Lemma 4

On the event \(\Omega _1\), for N sufficiently large, \(\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))\) is differentiable at 0. Furthermore, \( \frac{d\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))}{d\Delta } = -\lambda _N.\)

The lemma above allows us to take the derivative of \(q(\Delta )=l(\tilde{\varvec{\theta }}(\Delta ))-N\kappa _{\lambda _N}(\tilde{\mathbf {c}}(\Delta ))\) with respect to \(\Delta \) on the event \(\Omega _1\),

$$\begin{aligned} \dot{q}(0)=\sum _{m\in B}\frac{\partial l(\hat{\varvec{\theta }})}{\partial c_{1,m}}+ N\lambda _N. \end{aligned}$$

Recall that on event \(\Omega _1\), \(|\sum _{m\in B}\frac{\partial l(\hat{\varvec{\theta }})}{\partial c_{1,m}}|\le C_1\sqrt{N}\mathrm {Card}(B)\). This, together with Lemma 4, gives

$$\begin{aligned} \dot{q}(0)\ge \sqrt{N}(-C_1\mathrm {Card}(B)+\sqrt{N}\lambda _N). \end{aligned}$$

Note that \(\sqrt{N}\lambda _N\rightarrow \infty \) as \(N\rightarrow \infty \). Thus, \(\dot{q}(0)>0\) for sufficiently large N. This implies that \(q(\Delta )>q(0)=l(\hat{\varvec{\theta }})-N\kappa _{\lambda _N}(\hat{\mathbf {c}}^{\lambda _N})\) for sufficiently small positive \(\Delta \). It further implies that (27) holds for such \(\tilde{\varvec{\theta }}(\Delta )\), contradicting the definition of \(\hat{\varvec{\theta }}\). \(\square \)

Appendix 5: Proof of Supporting Lemmas

Proof of Lemma 1

Note that the event

$$\begin{aligned} \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c}) \end{aligned}$$

implies

$$\begin{aligned} \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \Big \{l(\varvec{\theta }')-l(\varvec{\theta })\}\ge N\inf _{\varvec{\theta }'\in \Theta }\{ \kappa _{\lambda _N}(\mathbf {c}')-\kappa _{\lambda _N}(\mathbf {c}) \Big \}. \end{aligned}$$

Thus, we have an upper bound for the probability

$$\begin{aligned} \begin{aligned}&P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c})\right) \\&\quad \le P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-l(\varvec{\theta })\}\ge N\inf _{\varvec{\theta }'\in \Theta }\{ \kappa _{\lambda _N}(\mathbf {c}')-\kappa _{\lambda _N}(\mathbf {c}) \}\right) . \end{aligned} \end{aligned}$$
(28)

According to the definition of \(\kappa _{\lambda _N}\), we have \(0\le \kappa _{\lambda _N}(\mathbf {c}')\le J(M-1)\times \frac{(a+1)^2\lambda _N^2}{2}\). Therefore, (28) is further bounded above by

$$\begin{aligned} \begin{aligned}&P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-N\kappa _{\lambda _N}(\mathbf {c}')\}\ge l(\varvec{\theta })-N\kappa _{\lambda _N}(\mathbf {c})\right) \\&\quad \le P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-l(\varvec{\theta })\}\ge C_3 N\lambda _N^2 \right) \end{aligned} \end{aligned}$$

where \(C_3=J(M-1)\times \frac{(a+1)^2}{2}\) is a constant. Note that \(\lambda _N\rightarrow 0\) as \(N\rightarrow \infty \), so the right-hand side of the above display is the type I error probability of the generalized likelihood ratio test with a \(e^{o(N)}\) cut-off value for testing

$$\begin{aligned} H_0: \varvec{\theta }'=\varvec{\theta } \text{ against } H_1: \Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta , \end{aligned}$$
(29)

whose exponential decay rate has been established in Lemma 3 in Li, Liu, and Ying (2016), that there exists a rate \(\rho >0\) such that \(P\Big ( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \ge \varepsilon _1,\varvec{\theta }'\in \Theta } \{l(\varvec{\theta }')-l(\varvec{\theta })\}\ge C_2 N\lambda _N^2 \Big )=e^{-(\rho +o(1))N}.\) Choosing \(\varepsilon _2\) to be positive and smaller than \(\rho \), we conclude our proof. \(\square \)

Proof of Lemma 2

Because \(\kappa _{\lambda _N}(\mathbf {c}')=\sum _{j=1}^J p_{\lambda _N}(\mathbf {c}'_j)\), it is sufficient to show that for each \(j\in \{1,\ldots ,J\}\)

$$\begin{aligned} p_{\lambda _N}(\mathbf {c}'_j)\ge \frac{(a+1)^2\lambda ^2}{2}\mathrm {Card}(\{ m: c_{j,(m+1)}-c_{j,(m)}>0 \}). \end{aligned}$$

Similar to the discussion proceeding (26), we only need to prove that for each \(j\in \{1,..,J\}\),

$$\begin{aligned} \mathrm {Card}(\{ m:c'_{j,(m+1)}-c'_{j,(m)}\ge a\lambda _N \})\ge \mathrm {Card}(\{ m: c_{j,(m+1)}-c_{j,(m)}>0 \}). \end{aligned}$$
(30)

We first prove that for each \(m\in \{1,\ldots ,M-1\}\), if \(c_{j,(m+1)}-c_{j,(m)}>0\), then there exists \(m'\in \{1,\ldots ,M-1\}\) such that

$$\begin{aligned} |c'_{j,m'}-c_{j,(m)}|\le \frac{1}{4}\mathbf {gap}(\mathbf {c}_j) \text{ and } \min \{c'_{j,l}-c'_{j,m'}: {c'_{j,l}>c'_{j,m'}} \}\ge a\lambda _N. \end{aligned}$$
(31)

To proceed, we define a set \(D=\{ l:c_{j,l}=c_{j,(m)} \}\). We choose \(m' \in D\) such that \( c'_{j,m'}=\max _{k\in D} c'_{j,k}.\) Recall that we assume \(\Vert \mathbf {c}'-\mathbf {c}\Vert \le \frac{1}{4}\min _{1\le j\le J}\mathbf {gap}(\mathbf {c}_j)\). Thus, we have \(|c'_{j,m'}-c_{j,(m)}| = |c'_{j,m'}-c_{j,m'}|\le \frac{1}{4}\mathbf {gap}(\mathbf {c}_j)\). Moreover, for each l such that \(c'_{j,l}>c'_{j,m'}\), \(l \notin D\), due to the choice of \(m'\). We then show

$$\begin{aligned} c'_{j,l} > c_{j,(m)} + \frac{1}{2} \mathbf {gap}(\mathbf {c}_j) \end{aligned}$$

by contradiction. If \(c'_{j,l} \le c_{j,(m)} + \frac{1}{2} \mathbf {gap}(\mathbf {c}_j) \), then

$$\begin{aligned} c_{j,l} < c_{j,(m)} + \mathbf {gap}(\mathbf {c}_j). \end{aligned}$$
(32)

Since \(c_{j,(m+1)}\ge c_{j,(m)}+\mathbf {gap}(\mathbf {c}_j)\), combining with (32) implies that

$$\begin{aligned} c_{j,l} = c_{j,(m)} \text{ or } c_{j,l} < c_{j,(m)}. \end{aligned}$$

On one hand, \(c_{j,l} = c_{j,(m)}\) contradicts \(l \notin D\). On the other, if \(c_{j,l} < c_{j,(m)}\), then \(c_{j,l}\le c_{j,(m-1)}\) and

$$\begin{aligned} c'_{j,l} \le c_{j,l} + \frac{1}{4} \mathbf {gap}(\mathbf {c}_j) \le c_{j,(m-1)} + \frac{1}{4} \mathbf {gap}(\mathbf {c}_j) \le c'_{j,m'}, \end{aligned}$$

contradicting \(c'_{j,l}>c'_{j,m'}\). Therefore,

$$\begin{aligned} c'_{j,l} > c_{j,(m)} + \frac{1}{2} \mathbf {gap}(\mathbf {c}_j) \ge c'_{j,m'} + \frac{1}{4} \mathbf {gap}(\mathbf {c}_j)\ge c'_{j,m'} +a\lambda _N, \end{aligned}$$

when \(\lambda _{N}\le \frac{1}{4a}\min _{j\in \{1,\ldots ,J\}}\mathbf {gap}(\mathbf {c}_j)\). Therefore, (31) holds for \(\lambda _{N}\le \frac{1}{4a}\min _{j\in \{1,\ldots ,J\}}\mathbf {gap}(\mathbf {c}_j)\). Notice that for different m such that \(c_{j,(m+1)}-c_{j,(m)}>0\), the corresponding \(m'\) such that (31) holds are distinct. Thus, (30) is proved. \(\square \)

Proof of Lemma 3

According to Theorem 1, for each \(\varepsilon \), there exists a constant C such that for sufficiently large N,

$$\begin{aligned} P\left( \Vert \hat{\varvec{\theta }}-\varvec{\theta }\Vert >\frac{C}{\sqrt{N}}\right) <\frac{\varepsilon }{2}. \end{aligned}$$
(33)

Now, for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\), we expand \(\nabla l(\varvec{\theta }')\) around \(\varvec{\theta }\),

$$\begin{aligned} \Vert \nabla l(\varvec{\theta }')-\nabla l(\varvec{\theta }) \Vert \le \sup _{\Vert \tilde{\varvec{\theta }}-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}} \Vert \nabla ^2 l(\tilde{\varvec{\theta }}) \Vert \Vert \varvec{\theta }'-\varvec{\theta }\Vert . \end{aligned}$$

By (21) the right-hand side of the above display is further bounded above by \(\eta N\times \frac{C}{\sqrt{N}}= C\eta \sqrt{N}\). Thus, for \(\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}\),

$$\begin{aligned} \Vert \nabla l(\varvec{\theta }')\Vert \le \Vert \nabla l(\varvec{\theta }) \Vert + C\eta \sqrt{N}. \end{aligned}$$

Taking the supremum with respect to \(\varvec{\theta }'\) in the above display, we have

$$\begin{aligned} \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}}\Vert \nabla l(\varvec{\theta }')\Vert \le \Vert \nabla l(\varvec{\theta }) \Vert + C\eta \sqrt{N}. \end{aligned}$$

Note that \(\Vert \nabla l(\varvec{\theta }) \Vert =O_P(\sqrt{N})\). This and the above display yield

$$\begin{aligned} \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}}\Vert \nabla l(\varvec{\theta }')\Vert = O_P(\sqrt{N}). \end{aligned}$$

Consequently, we can choose \(C_1\) sufficiently large such that

$$\begin{aligned} P\left( \sup _{\Vert \varvec{\theta }'-\varvec{\theta }\Vert \le \frac{C}{\sqrt{N}}, }\Vert \nabla l(\varvec{\theta }')\Vert >C_1\sqrt{N}\right) <\frac{\varepsilon }{2}. \end{aligned}$$

We combine this with (33), concluding the proof. \(\square \)

Proof of Lemma 4

Let \(K=\mathrm {Card}(\{\hat{c}^{\lambda _N}_{1,1},\ldots ,\hat{c}^{\lambda _N}_{1,M} \})\) be the number of distinct values in \(\hat{\mathbf {c}}^{\lambda _N}_1\). Define the vector of ordered distinct values in \(\hat{\mathbf {c}}^{\lambda _N}_1\) as \(\hat{\gamma }=(\hat{\gamma }_1,\ldots ,\hat{\gamma }_K)^T\) such that \(\hat{\gamma }_1<\hat{\gamma }_2<\cdots <\hat{\gamma }_K\) and \(\{ \hat{\gamma }_1,\ldots ,\hat{\gamma }_K \}=\{ \hat{c}_{1,1}^{\lambda _N},\ldots ,\hat{c}_{1,M}^{\lambda _N} \}\). We define \(\tilde{\gamma }\) in the same manner. Let \(k^*\) satisfy \(\hat{\gamma }_{k^*}=\min _{l\in A}\hat{c}^{\lambda _N}_{1,l}\). We choose \(|\Delta |<\min \{\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*} ,\hat{\gamma }_{k^*}-\hat{\gamma }_{k^*-1}\}\). Then \(\tilde{\mathbf {c}}_1\) and \(\hat{\mathbf {c}}^{\lambda _N}_1\) have the same partially merged pattern and for \(k=1,\ldots ,K\)

$$\begin{aligned} \tilde{\gamma }_k= {\left\{ \begin{array}{ll} \hat{\gamma }_k &{} \text{ if } k\ne k^*,\\ \hat{\gamma }_k +\Delta &{} \text{ if } k=k^*. \end{array}\right. } \end{aligned}$$
(34)

The penalty term for \(\tilde{\mathbf {c}}_1\) is

$$\begin{aligned} p_{\lambda _N}(\tilde{\mathbf {c}}_1)=\sum _{k=1}^{K-1}p_{\lambda _N}^{SCAD}(\tilde{\gamma }_{k+1}-\tilde{\gamma }_k). \end{aligned}$$

By (34), the above display becomes

$$\begin{aligned} p_{\lambda _N}(\tilde{\mathbf {c}}_1)=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta )+p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1}) + \sum _{k\notin \{k^*,k^*-1\} }p^{SCAD}(\hat{\gamma }_{k+1}-\hat{\gamma }_k), \end{aligned}$$

where we set \(p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1})\) to be 0 if \(k^*=1\). We compare this with the penalty term of \(\hat{\mathbf {c}}_1^{\lambda _N}\)

$$\begin{aligned} p_{\lambda _N}(\tilde{\mathbf {c}}_1)-p_{\lambda _N}(\hat{\mathbf {c}}_1^{\lambda _N})=q_1(\Delta )+q_2(\Delta ), \end{aligned}$$
(35)

where we define \( q_1(\Delta )=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta )-p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}) \) and \( q_2(\Delta )=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1})-p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}-\hat{\gamma }_{k^*-1}). \) We will show that \(\dot{q}_1(0)=-\lambda _N\) and \(\dot{q}_2(0)=0\). To proceed, we first analyze \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}\). Let \(m^*_2\) satisfy \(\hat{c}_{1,m^*_2}^{\lambda _N}=\min _{l\in A}\hat{c}_{1,l}^{\lambda _N} = \hat{\gamma }_{k^*}\). According to the definition of the set A, there exists \(m_1\) such that \(\hat{c}_{1,m_1}^{\lambda _N}>\hat{c}_{1,m^*_2}^{\lambda _N}\) and \(c_{1,m_1}=c_{1,m^*_2}\). Note that \(\hat{c}_{1,m_1}^{\lambda _N}\le c_{1,m_1}+\frac{C}{\sqrt{N}}\) and \(\hat{c}_{1,m^*_2}^{\lambda _N}\ge c_{1,m^*_2}-\frac{C}{\sqrt{N}}\) on the event \(\Omega _1\), so \(\hat{c}_{1,m_1}^{\lambda _N}\le \hat{c}_{1,m^*_2}^{\lambda _N}+\frac{2C}{\sqrt{N}}\). Recall that \(\hat{\gamma }_{k^*}= \min _{l\in A}\hat{c}_{1,l}^{\lambda _N}=\hat{c}_{1,m^*_2}^{\lambda _N}\) and \(\hat{\gamma }_{k^*+1}\le \hat{c}^{\lambda _N}_{1,m_1}\). Thus, \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}\le \frac{2C}{\sqrt{N}}\). Because \(\lambda _N\sqrt{N}\rightarrow \infty \) as N grows large, \(\frac{2C}{\sqrt{N}}<\frac{\lambda _N}{2}\) for sufficiently large N. Consequently, for \(|\Delta |<\frac{C}{\sqrt{N}}\),

$$\begin{aligned} |\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta |<\lambda _N. \end{aligned}$$
(36)

According to the definition of \(p_{\lambda _N}^{SCAD}\) in (9) and (36), we have

$$\begin{aligned} q_{1}(\Delta )=\lambda _N|\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta | \text{ for } |\Delta |<\frac{C}{\sqrt{N}}. \end{aligned}$$

Because \(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}>0\), for \(|\Delta |<\min \{\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}, \frac{C}{\sqrt{N}} \}\),

$$\begin{aligned} q_{1}(\Delta )=\lambda _N(\hat{\gamma }_{k^*+1}-\hat{\gamma }_{k^*}-\Delta ). \end{aligned}$$

Therefore,

$$\begin{aligned} \dot{q}_1(0)=-\lambda _N. \end{aligned}$$

Now we proceed to the analysis of \(q_2\). If \(k^*=1\), then \(q_2(\Delta )\) is set to 0, and so is \(\dot{q}_2(\Delta )\). We proceed to the case where \(k^*\ge 2\). Choose \(m^*_1\) such that \(\hat{c}^{\lambda _N}_{1,m^*_1}=\hat{\gamma }_{k^*-1}\). As \(\hat{c}^{\lambda _N}_{1,m^*_1}=\hat{\gamma }_{k^*-1}<\hat{\gamma }_{k^*}=\hat{c}^{\lambda _N}_{1,m^*_2}\), we know \(m_1^*\notin A\) and \(c_{1,m^*_1}\ne c_{1,m^*_2}\) because of the definition of A and B. Furthermore, according to the analysis below (27), it is not possible to have \(c_{1,m^*_1}>c_{1,m^*_2}\) on event \(\Omega _1\). Thus, we have \(c_{1,m^*_1}<c_{1,m^*_2}\). Now let N be sufficiently large such that \(\frac{2C}{\sqrt{N}}<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{2}\), then \(\hat{c}^{\lambda _N}_{1,m^*_2}-\hat{c}^{\lambda _N}_{1,m^*_1}>\frac{c_{1,m^*_2}-c_{1,m^*_1}}{2}\) on the event \(\Omega _1\). Thus, for \(|\Delta |<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}\) we have

$$\begin{aligned} \hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1}>\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}. \end{aligned}$$

By the definition in (9), for N sufficiently large such that \(a\lambda _N<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}\),

$$\begin{aligned} q_2(\Delta )=p_{\lambda _N}^{SCAD}(\hat{\gamma }_{k^*}+\Delta -\hat{\gamma }_{k^*-1})=0 ~~ \text{ for } |\Delta |<\frac{c_{1,m^*_2}-c_{1,m^*_1}}{4}. \end{aligned}$$

Thus, \(\dot{q_2}(0)=0\). Combining this with \(\dot{q_1}(0)=-\lambda _N\) and (35), \( \frac{d}{d\Delta }\{p_{\lambda _N}(\tilde{\mathbf {c}}_1)-p_{\lambda _N}(\hat{\mathbf {c}}_1^{\lambda _N})\}|_{\Delta =0}=-\lambda _N. \) We conclude the proof by noting that \(\kappa _{\lambda _N}(\tilde{\mathbf {c}})=\sum _{j=1}^Jp_{\lambda _N}(\tilde{\mathbf {c}}_j)\) and that \(\tilde{\mathbf {c}}_j=\hat{\mathbf {c}}_j^{\lambda _N}\) for \(j\in \{2,\ldots ,J\}\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Li, X., Liu, J. et al. Regularized Latent Class Analysis with Application in Cognitive Diagnosis. Psychometrika 82, 660–692 (2017). https://doi.org/10.1007/s11336-016-9545-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-016-9545-6

Keywords

Navigation