Generalized Bayesian method for diagnostic classification models

Kazuhiro Yamaguchi; Yanlong Liu; Gongjun Xu

doi:10.1017/psy.2024.20

Generalized Bayesian method for diagnostic classification models

Published online by Cambridge University Press: 03 January 2025

Kazuhiro Yamaguchi

Yanlong Liu and

Gongjun Xu

Show author details

Kazuhiro Yamaguchi*: Affiliation:
University of Tsukuba, Tsukuba, Japan
Yanlong Liu: Affiliation:
University of Michigan, Ann Arbor, MI, USA
Gongjun Xu: Affiliation:
University of Michigan, Ann Arbor, MI, USA
*: Corresponding author: Kazuhiro Yamaguchi; Email: yamaguchi.kazuhir.ft@u.tsukuba.ac.jp

Article contents

Abstract
Introduction
Model setup and previous estimation methods
GB method for DCMs
Mathematical properties of the proposed method: Consistency of the MAP estimators
Simulation study
Real data example
Discussions and future directions
Data availability statement
Funding statement
Competing interests
References

Rights & Permissions

Abstract

This study extends the loss function-based parameter estimation method for diagnostic classification models proposed by Ma, de la Torre, et al. (2023, Psychometrika) to consider prior knowledge and uncertainty of sampling. To this end, we integrate the loss function-based estimation method with the generalized Bayesian method. We establish the consistency of attribute mastery patterns of the proposed generalized Bayesian method. The proposed generalized Bayesian method is compared in a simulation study and found to be superior to the previous nonparametric diagnostic classification method—a special case of the loss function-based method. Moreover, the proposed method is applied to real data and compared with previous parametric and nonparametric estimation methods. Finally, practical guidelines for the proposed method and future research directions are discussed.

Keywords

diagnostic classification models generalized Bayesian method loss function-based method parameter estimation

Type: Theory and Methods
Information: Psychometrika , First View , pp. 1 - 38

DOI: https://doi.org/10.1017/psy.2024.20 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

Learning is an important aspect of human life. The current status of individual knowledge or depth of understanding must be evaluated to ensure efficient learning. Test analysis models called diagnostic classification models (DCMs; Rupp et al., Reference Rupp, Templin and Henson2010; von Davier & Lee, Reference von Davier and Lee2019) have been popularly employed to capture an individual’s learning status. Notably, DCMs provide useful statistical tools to reveal individuals’ current learning status based on the test’s item responses. Latent knowledge or cognitive elements are called attributes and are expressed as latent categorical variables in DCMs. Moreover, DCMs are known as restricted latent class models (e.g., Rupp & Templin, Reference Rupp and Templin2008; Xu, Reference Xu2017), wherein each possible set of attributes represents a latent class. In other words, attribute mastery patterns indicate the attributes that are either mastered or not mastered. Therefore, one of the DCMs’ final outputs is the estimate of the attribute mastery patterns of individuals or attribute mastery probabilities.

Various parameter estimation methods for the DCMs have been actively developed. Parametric and nonparametric estimation methods are commonly used in DCMs. Parametric estimation methods assume parametric item response functions and structural models. Therefore, parametric estimation methods employ a likelihood function under the assumed model and include (penalized or regularized) maximum likelihood estimation (e.g., Chen et al., Reference Chen, Liu, Xu and Ying2015; de la Torre, Reference de la Torre2009; Ma, Ouyang, & Xu, Reference Ma, Ouyang and Xu2023) and Bayesian estimation methods (e.g., Culpepper, Reference Culpepper2015; Yamaguchi & Okada, Reference Yamaguchi and Okada2020; Yamaguchi & Templin, Reference Yamaguchi and Templin2022b), incorporating prior distributions for model parameters. Numerous parametric estimation methods have been developed and their properties have been studied (e.g., von Davier & Lee, Reference von Davier and Lee2019).

On the other hand, nonparametric methods (e.g., Chiu et al., Reference Chiu, Sun and Bian2018; Chiu & Douglas, Reference Chiu and Douglas2013) do not use probabilistic item response models; instead, they use an ideal response to define a type of discrepancy function, which will be formally defined in a later section. Such discrepancy functions are defined based on the distance between each item’s ideal and actual responses. Intuitively, nonparametric methods directly estimate attribute mastery patterns, which minimize the discrepancy function. Therefore, nonparametric methods do not require a probabilistic item response function. Nonparametric methods exhibit satisfactory statistical properties, such as consistency under certain conditions (Chiu & Köhn, Reference Chiu and Köhn2019; Wang & Douglas, Reference Wang and Douglas2015).

Recently, a general parameter estimation method that can uniformly express parametric and nonparametric methods was developed by Ma, de la Torre, and Xu (Reference Ma, de la Torre and Xu2023). The unified estimation method developed by Ma, de la Torre, and Xu (Reference Ma, de la Torre and Xu2023) is a loss function-based estimation method for DCMs. If we select cross-entropy for a loss function, its minimization corresponds to maximizing the joint maximum likelihood (Chiu et al., Reference Chiu, Köhn, Zheng and Henson2016). The distance or discrepancy function in nonparametric methods is a well-known loss function. Additionally, by adding penalty terms to a cross-entropy loss function, we obtain the maximum a posteriori (MAP) estimates, classical Bayesian estimates, to minimize it. These examples indicate that the loss function-based estimation method is flexible and can represent various estimation methods in a unified manner. Furthermore, a unified estimation algorithm for the loss function-based estimation method was available.

However, loss function-based methods exhibit certain limitations. First, these methods only provide point estimates, which may be problematic because we cannot evaluate how point estimates vary due to sampling or estimation variations. Therefore, we cannot evaluate the uncertainty of attribute mastery using the loss function-based method. Furthermore, attribute mastery probabilities for each individual are not expressed in the loss function-based method. This is the same problem that occurs in DCMs’ nonparametric estimation method. However, attribute mastery probabilities represent a more nuanced situation than attribute mastery pattern results, with or without mastery. Another limitation of these methods is that prior information on weight parameters in the generalized nonparametric method that defines generalized ideal responses is generally not considered. However, DCM users may have prior knowledge of the test items’ conjunctive and disjunctive nature. If so, domain-specific knowledge must be included to improve parameter estimates.

It is not only loss function-based methods that have limitations that need to be addressed; several limitations of previous parametric and nonparametric estimation methods likewise need to be noted. First, parametric estimation methods need to specify data-generating distributions, which determine the likelihood function. The likelihood function provides a connection between data and model parameters such as attribute mastery patterns. Moreover, likelihood functions make it possible to evaluate estimation uncertainty with the asymptotical theory within the maximum likelihood framework or the posterior distribution within the Bayesian framework. However, the data-generating process is not always specified. DCMs are part of the educational measurement model family that need various constraints and limitations, making it difficult to specify the model.

Some of the limitations of the nonparametric methods are the same as those of the loss function-based methods. For instance, current nonparametric methods for DCMs cannot evaluate the uncertainty of attribute mastery estimates. The nonparametric methods for DCMs were developed in studies with small sample sizes (Chiu & Douglas, Reference Chiu and Douglas2013). Ultimately, the nonparametric method for DCMs can be applied to individuals; however, the parameter estimates need to be evaluated with variability of parameter estimates. Currently, the nonparametric methods simply select attribute mastery patterns to minimize prespecified distance functions so the parameter uncertainty evaluation is not included in the framework. The parameter estimates with nonparametric methods can be changed by small differences in the loss function. One main purpose of DCMs is the diagnosis of individual knowledge. Thus, such variations in parameter estimates due to small differences in the distance functions may be a fundamental problem for application.

To overcome these limitations and extend the previous loss function-based estimation method for DCMs, we employ the generalized Bayesian (GB; Bissiri et al., Reference Bissiri, Holmes and Walker2016) method. The usual Bayesian parameter update determines the likelihood function and updates the model parameters in the likelihood with the observed dataset. By contrast, the GB method can express parameter updating with a dataset via loss functions. Therefore, Bayesian inference is applicable to nonparametric-based estimation methods as well as to likelihood-based methods. Moreover, other benefits of the GB method are as follows. First, the GB method originally assumes the $\mathcal{M}$ -open setting (Bernardo & Smith, Reference Bernardo and Smith2009, Chap. 6), which implies that the GB method provides a valid inference even if the assumed model does not match the true data-generating mechanism. Various DCMs have been developed; however, selecting an appropriate item response function that expresses the true data-generating mechanism is not always possible. The GB method does not require an entire data-generating model but instead sets a loss function related to the parameter sets of interest. This means that we do not need to find a correct data-generating model, which is always unknown and often misspecified. We expect the GB method to overcome the practical difficulties of DCMs’ applications.

Second, the GB method allows the use of flexible loss functions and priors. The uncertainty of the parameters expressed in the loss functions is easily demonstrated in the generalized posteriors generated using the GB method. In other words, the GB method can handle the amount of uncertainty of attribute mastery estimates. Not only point estimates but also uncertainty variation are important for careful decisions in diagnostic evaluation. The GB method provides a useful tool for addressing the above problems, which both parametric and nonparametric methods have. Furthermore, the generalized posterior is easily obtained using a Markov chain Monte Carlo (MCMC) routine, such as the Metropolis–Hastings method. Third, we can control the relative importance between the dataset and the prior via the learning rate parameter. If the obtained data’s quality is questionable, an inference that is completely dependent on the data may lead to inappropriate decisions. In such cases, the data’s relative importance can be reduced. The learning rate parameter enables a more flexible inference.

Based on these discussions, we develop a GB method to overcome the limitations of the loss function-based estimation method for DCMs (Ma, de la Torre, & Xu, Reference Ma, de la Torre and Xu2023). The remainder of this article is organized as follows: The second section demonstrates the basic setup of the DCMs and the previous loss function-based estimation method. The third section provides the GB method’s fundamentals and its application to DCMs based on their loss functions. Therein, the MCMC algorithm for a generalized posterior is also discussed. The GB method’s mathematical properties under certain conditions are discussed in the fourth section. The fifth and sixth sections comprise simulation and real data analysis examples of GB inference for DCMs, wherein we compare previous nonparametric estimation methods in a simulation study. Finally, the seventh section serves as the discussion, where the limitations of the GB inference and future directions of DCMs’ estimation methods are discussed.

2 Model setup and previous estimation methods

2.1 Model setup of DCMs

First, we express an individual’s attribute mastery pattern using a vector of length $K,{\boldsymbol{\unicode{x3b1}}}_i\in {\left\{0,1\right\}}^K,$ where $i\in \left\{1,2,\dots, I\right\}$ . The $k$ th element of the attribute mastery pattern vector ${\boldsymbol{\unicode{x3b1}}}_i$ is ${\unicode{x3b1}}_{ik}\in \left\{0,1\right\}$ , where $k\in \left\{1,2,\dots, K\right\}$ ; it takes one if individual $i$ masters attribute $k$ , and otherwise, it takes 0. In this study, we assume unconditional attribute mastery patterns, where all possible attribute mastery patterns and the number is $L={2}^K$ . Therefore, the $l\in \left\{1,2,\dots, L\right\}$ -th attribute mastery pattern can be written as ${\boldsymbol{\unicode{x3b1}}}_l$ . The set of attribute mastery patterns for all individuals is $\mathcal{A}={\left\{{\boldsymbol{\unicode{x3b1}}}_i\right\}}_{i=1}^I$ . To define the parametric measurement model, we also need to specify the diagnostic relationship between the attributes and item sets.

The diagnostic relationship between the attributes and test items is called the $\boldsymbol{q}$ -vector, ${\boldsymbol{q}}_j\in {\left\{0,1\right\}}^K\setminus \left\{{\mathbf{0}}_K\right\}$ , where $j\in \left\{1,2,\dots, J\right\}$ ; if the $k$ th attribute is required for item $j,{q}_{jk}=1$ ; otherwise ${q}_{jk}=0$ . Additionally, ${\mathbf{0}}_K$ is a vector of length $K$ and all its elements are 0. Here, we assume there is no item with ${\boldsymbol{q}}_j={\mathbf{0}}_K$ . The Q-matrix (Tatsuoka, Reference Tatsuoka1985) is a $J\times K$ matrix defined by ${\left({\boldsymbol{q}}_1^{\top },{\boldsymbol{q}}_2^{\top },\dots, {\boldsymbol{q}}_J^{\top}\right)}^{\top }$ .

Parametric DCMs define their measurement models using attribute mastery patterns and a $\boldsymbol{q}$ -vector. For example, one of the most general DCMs, known as the log linear cognitive diagnostic model (LCDM; Henson et al., Reference Henson, Templin and Willse2009), uses an item parameter vector ${\boldsymbol{\unicode{x3bb}}}_j={\left({\unicode{x3bb}}_{j0},{\unicode{x3bb}}_{j1},\dots, {\unicode{x3bb}}_{j12\dots K}\right)}^{\top }$ , and the measurement model of ${X}_{ij}=1$ , which is a conditional response probability of individual $i$ for item $j$ , is:

(1)

$$\begin{align}P\left({X}_{ij}=1\mid {\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)=\frac{\exp \left(f\left({\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i\right)\right)}{1+\exp \left(f\left({\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i\right)\right)},\end{align}$$

where $f\left({\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{\unicode{x3b1}}}_i\right)$ is:

(2)

$$\begin{align}f\left({\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i\right)&=\log \frac{P\left({X}_{ij}=1\mid {\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)}{1-P\left({X}_{ij}=1\mid {\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)}\nonumber\\&={\unicode{x3bb}}_{j0}+ \sum\limits_{k=1}^K{\unicode{x3bb}}_{jk}{q}_{jk}{\unicode{x3b1}}_{ik}+ \sum\limits_{k=1}^K \sum\limits_{k^{\prime }<k}{\unicode{x3bb}}_{jk{k}^{\prime }}{q}_{jk}{q}_{j{k}^{\prime }}{\unicode{x3b1}}_{ik}{\unicode{x3b1}}_{i{k}^{\prime }}+\cdots +{\unicode{x3bb}}_{j12\dots K}{\prod}_{k=1}^K{q}_{jk}{\unicode{x3b1}}_{ik}.\end{align}$$

The LCDM has several parameters. The first parameter is the intercept ${\unicode{x3bb}}_{j0}$ , which determines the baseline correct item response probability. Attribute mastery patterns that do not master any attributes requiring item $j$ take response probability. The main effect parameters were ${\unicode{x3bb}}_{j1},{\unicode{x3bb}}_{j2},\dots, {\unicode{x3bb}}_{jK}$ ; each parameter affected the correct item response probabilities with the corresponding attributes. The first-order interaction parameter ${\unicode{x3bb}}_{jk{k}^{\prime }}$ is the effect of simultaneously mastering the attributes $k$ and ${k}^{\prime }$ . Similarly, we introduce the highest interaction term as ${\unicode{x3bb}}_{j12\dots K}$ . General cognitive diagnosis models that are similar to LCDM have also been proposed in the literature, including the generalized DINA (GDINA) model (de la Torre, Reference de la Torre2011) and general diagnostic model (GDM; von Davier, Reference von Davier2008).

As some attributes are not measured by item $j$ , the number of estimated item parameters under LCDM is ${2}^{\sum_k\kern0.1em {q}_{jk}}\le {2}^K$ . Moreover, notably, one-to-one mapping exists between the LCDM item parameters and conditional item response probabilities (Rupp et al., Reference Rupp, Templin and Henson2010). Therefore, it is convenient to use conditional item response probabilities to develop parameter estimation methods. The same strategy was adopted in previous studies (Yamaguchi & Okada, Reference Yamaguchi and Okada2020; Yamaguchi & Templin, Reference Yamaguchi and Templin2022b), where DCMs are a restricted version of latent class models (e.g., Rupp & Templin, Reference Rupp and Templin2008; Xu & Shang, Reference Xu and Shang2018).

Therefore, let the correct item response probability be parameter ${\unicode{x3b8}}_{jl}$ :

(3)

$$\begin{align}{\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}=P\left({X}_{ij}=1\mid {\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right),\end{align}$$

Additionally, the attribute mastery mixing parameters ${\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_1},{\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_2},\dots, {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_L}\in \left(0,1\right)$ are defined as ${\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}=P\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)$ , satisfying ${\sum}_l\kern0.1em {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}=1$ . From this notation, the complete data likelihood function of the LCDM is:

(4)

$$\begin{align}\mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)=\prod \limits_{i=1}^I\kern0.20em \prod \limits_{j=1}^J\kern0.20em \prod \limits_{l=1}^L\kern0.20em {\left\{{\unicode{x3c0}}_l{\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}^{x_{ij}}{\left(1-{\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}\right)}^{1-{x}_{ij}}\right\}}^{\mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)},\end{align}$$

where $X={\left\{{x}_{ij}\right\}}_{i,j=1}^{N,J},\Theta ={\left\{{\unicode{x3b8}}_{jl}\right\}}_{j,l=1}^{J,L},\boldsymbol{\unicode{x3c0}} ={\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_1},{\unicode{x3c0}}_{\unicode{x3b1}_2},\dots, {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_L}\right)}^{\top }$ , and $\mathcal{I}\left(\cdotp \right)$ is an indicator function.

We add some remarks on the correct item response probabilities for item $j$ . First, as mentioned previously, some attribute mastery patterns have the same item response probabilities because of the setting of the $\boldsymbol{q}$ vector. Moreover, some submodels of the LCDM assume fewer parameters than the general LCDM and have parsimonious model forms. The model settings for each item can differ, but we assume that all test items have the same general LCDM form.

Second, the correct item response probabilities for item $j$ exhibit an ordinal relationship: These relationships are known as monotonicity constraints (Xu & Shang, Reference Xu and Shang2018). The formal expression of the monotonicity constraints proposed by Xu and Shang (Reference Xu and Shang2018) is:

(5)

$$\begin{align}\underset{\boldsymbol{\unicode{x3b1}} :\boldsymbol{\unicode{x3b1}} \succcurlyeq {\boldsymbol{q}}_j}{\max}\kern0.1em {\unicode{x3b8}}_{j,\boldsymbol{\alpha}}=\underset{\boldsymbol{\unicode{x3b1}} :\boldsymbol{\unicode{x3b1}} \succcurlyeq {q}_j}{\min}\kern0.1em {\unicode{x3b8}}_{j,\boldsymbol{\unicode{x3b1}}}\ge {\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}^{\prime }}\ge {\unicode{x3b8}}_{j,{\mathbf{0}}_K},\end{align}$$

where we write $\boldsymbol{\unicode{x3b1}} \succcurlyeq {\boldsymbol{q}}_j$ if ${\alpha}_k\succcurlyeq {q}_{jk},\forall k$ ; otherwise, . These constraints imply that the patterns mastering all skills measured in item $j\left(\boldsymbol{\unicode{x3b1}} :\boldsymbol{\unicode{x3b1}} \succcurlyeq {\boldsymbol{q}}_j\right)$ should have the highest of all the patterns. By contrast, all nonmastering patterns had the lowest correct item response probability. The middle mastering patterns satisfying have response probabilities between these two probabilities.

2.2 Loss function-based parameter estimation

This section introduces the loss function-based parameter estimation method proposed in Ma, de la Torre, and Xu (Reference Ma, de la Torre and Xu2023). First, we describe certain elements of the loss function-based method. In this framework, we introduce the length $J$ centroid parameter vector: ${\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}={\left({\unicode{x3bc}}_{1,{\boldsymbol{\unicode{x3b1}}}_l},{\unicode{x3bc}}_{2,{\boldsymbol{\unicode{x3b1}}}_l},\dots, {\unicode{x3bc}}_{J,{\boldsymbol{\unicode{x3b1}}}_l}\right)}^{\top}\in {\mathbb{R}}^J$ . Additionally, a penalty term for the mixing parameter ${\unicode{x3c0}}_{\unicode{x3b1}_l}$ is introduced as $h\left({\unicode{x3c0}}_{\unicode{x3b1}_l}\right)\in \mathbb{R}$ . Furthermore, an element-wise loss function taking item response vector ${\boldsymbol{x}}_i$ and a centroid parameter vector ${\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}$ is expressed as $\ell \left({\boldsymbol{x}}_i,{\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)$ ; its codomain is real positive number ${\mathbb{R}}^{+}$ . The $\ell \left({\boldsymbol{x}}_i,{\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)$ is the individual-level loss function. Therefore, the loss function of the entire dataset is based on the individual-level loss function:

(6)

$$\begin{align}\mathcal{L}\left(\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right)=\sum \limits_{l=1}^L\kern0.20em \sum \limits_{i:{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l}\kern0.20em \left\{\ell \left({\boldsymbol{x}}_i,{\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)+h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)\right\}.\end{align}$$

The second summation takes over the individuals with the attribute mastery pattern ${\boldsymbol{\unicode{x3b1}}}_l$ .

Parameter estimates are obtained to minimize the loss function defined above:

(7)

$$\begin{align}\left\{\widehat{\mathcal{A}},\widehat{\boldsymbol{\unicode{x3bc}}},\widehat{\boldsymbol{\unicode{x3c0}}}\right\}=\underset{\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}}}{\mathrm{argmin}}\;\mathcal{L}\left(\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right).\end{align}$$

Directly minimizing the above loss function is not easy; therefore, we use the iterative update rule instead. In the estimation algorithm, we first set initial parameters $\left\{{\boldsymbol{\unicode{x3bc}}}^{(0)},{\boldsymbol{\unicode{x3c0}}}^{(0)}\right\}$ . When we have parameter estimates at $t$ th iteration, $\left\{{\boldsymbol{\mu}}^{(t)},{\boldsymbol{\pi}}^{(t)}\right\}$ , the following update steps are repeated:

(8)

$$\begin{align}\mathrm{Step}\;1:\left\{{\mathcal{A}}^{\left(t+1\right)}\right\}&=\underset{\mathcal{A}}{\mathrm{argmin}}\;\mathcal{L}\left(\mathcal{A},{\boldsymbol{\unicode{x3bc}}}^{(t)},{\boldsymbol{\unicode{x3c0}}}^{(t)}\right),\\ \mathrm{Step}\;2:\left\{{\boldsymbol{\unicode{x3bc}}}^{\left(t+1\right)},{\boldsymbol{\unicode{x3c0}}}^{\left(t+1\right)}\right\}&=\underset{\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}}}{\mathrm{argmin}}\;\mathcal{L}\left({\mathcal{A}}^{\left(t+1\right)},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right).\nonumber\end{align}$$

If the predetermined convergence criterion is satisfied, for example, $\unicode{x3b5} >1-{\sum}_i\kern0.1em \left(\mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i^{\left(t+1\right)}={\boldsymbol{\unicode{x3b1}}}_i^{(t)}\right)\right)/ I,0<\epsilon <1$ , the update process is stopped, and the parameter estimates become output: $\left\{\widehat{\mathcal{A}},\widehat{\boldsymbol{\unicode{x3bc}}},\widehat{\boldsymbol{\unicode{x3c0}}}\right\}=\left\{{\mathcal{A}}^{\left(t+1\right)},{\boldsymbol{\unicode{x3bc}}}^{\left(t+1\right)},{\boldsymbol{\unicode{x3c0}}}^{\left(t+1\right)}\right\}$ .

Many previous estimation methods can be viewed as special cases of the general loss function formulation framework. The joint likelihood estimation of the parametric DCM is a first example. In the following, we focus on the deterministic inputs noisy, “and” gate model (DINA model; Junker & Sijtsma, Reference Junker and Sijtsma2001; MacReady & Dayton, Reference MacReady and Dayton1977; Maris, Reference Maris1999) as an example because it is well-known and considered the most parsimonious DCM. The loss function further used to obtain the MAP estimation, which is the negative of the sum of the log-likelihood and log-prior density functions, is also presented. Subsequently, a nonparametric classification method (NPC; Chiu & Douglas, Reference Chiu and Douglas2013; Wang & Douglas, Reference Wang and Douglas2015) and generalized NPC (GNPC; Chiu & Köhn, Reference Chiu and Köhn2019; Chiu et al., Reference Chiu, Sun and Bian2018) are formulated under the above framework. Furthermore, NPC and GNPC are extended to the GB framework in a later section.

The DINA model is the simplest and most fundamental DCM, which is a special case of the LCDM. The DINA model assumes only the intercept and the highest interaction terms of the LCDM item parameters. Let the subscript set of attributes measured by item $j$ be $\mathcal{K}=\left\{k;{q}_{jk}=1,k=1,2,\dots, K\right\}$ and let the LCDM kernel for the DINA model be reduced to:

(9)

$$\begin{align}f\left({\boldsymbol{\unicode{x3bb}}}_j,{\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_i\right)={\lambda}_{j0}+{\lambda}_{j\mathcal{K}}\prod \limits_{k\in \mathcal{K}}\kern0.20em {\unicode{x3b1}}_{ik}.\end{align}$$

In the conventional DINA formulation, two-item response probabilities are represented by estimating the ${g}_j$ and slipping ${s}_j$ parameters:

(10)

$$\begin{align}{g}_j=\frac{\exp \left({\unicode{x3bb}}_{j0}\right)}{1+\exp \left({\unicode{x3bb}}_{j0}\right)},\end{align}$$

(11)

$$\begin{align}1-{s}_j=\frac{\exp \left({\lambda}_{j0}+{\lambda}_{j\mathcal{K}}\prod \limits_{k\in \mathcal{K}}\kern0.20em {\alpha}_{ik}\right)}{1+\exp \left({\lambda}_{j0}+{\lambda}_{j\mathcal{K}}\prod \limits_{k\in \mathcal{K}}\kern0.20em {\alpha}_{ik}\right)}.\end{align}$$

The guessing parameter ${g}_j$ indicates the chance level of a correct item response for attribute mastery patterns that lack at least one attribute required by item $j$ . The slipping parameter ${s}_j$ is the incorrect response probability of all-mastering-attribute mastery patterns required by item $j$ . Both ${g}_j$ and ${s}_j$ can be represented as functions of the ideal responses

(12)

$$\begin{align}{\eta}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)=\prod \limits_{k=1}^K\kern0.20em {\unicode{x3b1}}_{lk}^{q_{jk}}.\end{align}$$

The ideal response represents the response of an individual who belongs to the $l$ th attribute mastery pattern for item $j$ without errors. Then, ${g}_j$ and ${s}_j$ are represented as conditional probabilities:

(13)

$$\begin{align}{g}_j=P\left({X}_j=1\mid {\eta}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)=0\right),\end{align}$$

(14)

$$\begin{align}{s}_j=P\left({X}_j=0\mid {\eta}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)=1\right).\end{align}$$

Using the item response probabilities, the centroid parameter under the DINA model is:

(15)

$$\begin{align}{\unicode{x3b8}}_{jl}={g}_j^{1-{\eta}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)}{\left(1-{s}_j\right)}^{\eta_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)}.\end{align}$$

Assuming a cross-entropy loss, which is $-\log y$ , the likelihood-based loss function for the DINA model is:

(16)

$$\begin{align}\mathcal{L}\left(\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right)=-\sum \limits_{i=1}^I\kern0.20em \sum \limits_{l=1}^L\kern0.20em \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)\left[\sum \limits_{j=1}^J\kern0.20em \left\{{x}_{ij}\log {\unicode{x3b8}}_{jl}+\left(1-{x}_{ij}\right)\log \left(1-{\unicode{x3b8}}_{jl}\right)\right\}+\log {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right].\end{align}$$

We assume $h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)=-\log {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}$ . The loss function defined in Equation 16 is equivalent to a negative log complete likelihood function. Therefore, minimizing equation 16 corresponds to maximizing the likelihood function; the minimizers $\left\{\hat{\mathcal{A}},\hat{\boldsymbol{\unicode{x3bc}}},\hat{\boldsymbol{\unicode{x3c0}}}\right\}$ can be considered as the maximum likelihood estimate.

Subsequently, we examine the NPC and GNPC methods. Following Chiu and Douglas (Reference Chiu and Douglas2013), the loss function in the NPC method is defined by the Hamming distance between the individual item response vector and the ideal response vector:

(17)

$$\begin{align}\ell \left({\boldsymbol{x}}_i,{\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)=\sum \limits_{j=1}^J\kern0.20em \ell \left({x}_{ij},{\unicode{x3bc}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}\right)=\sum \limits_{j=1}^J\kern0.20em \left|{x}_{ij}-{\unicode{x3b7}}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right|.\end{align}$$

In the NPC method, the centroid parameter is the ideal response ${\unicode{x3bc}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}$ . The NPC estimates are obtained to minimize Equation 17 for each individual:

(18)

$$\begin{align}{\hat{\boldsymbol{\unicode{x3b1}}}}_i=\underset{{\boldsymbol{\unicode{x3b1}}}_l}{\mathrm{argmin}}\sum \limits_{j=1}^J\kern0.20em \left|{x}_{ij}-{\unicode{x3b7}}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right|,\forall i.\end{align}$$

Clearly, the Hamming distance is a loss function, and the NPC method is a loss function-based estimation method.

As introduced in Chiu et al. (Reference Chiu, Sun and Bian2018), the GNPC is a type of generalization that employs DINA-type and deterministic inputs noisy, “or” gate (DINO; Templin & Henson, Reference Templin and Henson2006)-type ideal responses to define a generalized ideal response. The DINO-type ideal response is

(19)

$$\begin{align}{\unicode{x3b7}}_j^{DINO}\left({\boldsymbol{\unicode{x3b1}}}_l\right)=1-\prod \limits_{k=1}^K\kern0.20em {\left(1-{\unicode{x3b1}}_{lk}\right)}^{q_{jk}},\end{align}$$

and ${\unicode{x3b7}}_j^{DINO}\left({\boldsymbol{\unicode{x3b1}}}_l\right)$ becomes one if pattern $l$ masters at least one attribute required for item $j$ ; otherwise, it becomes 0. The generalized ideal response is then defined as

(20)

$$\begin{align}{\unicode{x3b7}}_j^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)={w}_{jl}{\unicode{x3b7}}_j^{DINA}\left({\boldsymbol{\unicode{x3b1}}}_l\right)+\left(1-{w}_{jl}\right){\unicode{x3b7}}_j^{DINO}\left({\boldsymbol{\unicode{x3b1}}}_l\right),\end{align}$$

where ${w}_{jl}\in \left[0,1\right]$ is a weight parameter that determines an item’s tendency. If the item is more like DINA or conjunctive, ${w}_{jl}$ is close to one. By contrast, a ${w}_{jl}$ near zero means that the item is DINO-like or disjunctive in nature. The GNPC assumes a Euclidean distance for its loss function

(21)

$$\begin{align}d\left({\boldsymbol{x}}_j,{\boldsymbol{\unicode{x3b7}}}^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)=\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right){\left({x}_{ij}-{\unicode{x3b7}}_j^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)}^2,\end{align}$$

where ${\boldsymbol{\unicode{x3b7}}}^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)={\left({\unicode{x3b7}}_1^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right),{\unicode{x3b7}}_2^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right),\dots, {\unicode{x3b7}}_J^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)}^{\top }$ . The weight parameter is estimated via ${\grave{w}}_{jl}=1-{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right){x}_{ij}/{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)$ . The loss function of the GNPC is

(22)

$$\begin{align}\mathcal{L}\left(\left\{\mathcal{A},W\right\}\right)=\sum \limits_{j=1}^J\kern0.20em \sum \limits_{l=1}^L\kern0.20em d\left({\boldsymbol{x}}_j,{\boldsymbol{\unicode{x3b7}}}^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)=\sum \limits_{j=1}^J\kern0.20em \sum \limits_{l=1}^L\kern0.20em \sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right){\left({x}_{ij}-{\unicode{x3b7}}_j^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)}^2,\end{align}$$

where $W={\left\{{w}_{jl}\right\}}_{j,l=1}^{J,L}$ . The GNPC requires iterative updates of weight ${w}_{jl}$ and attribute mastery patterns. The detailed update rule is described in Chiu et al. (Reference Chiu, Sun and Bian2018). Note that if ${\unicode{x3b7}}_j^{DINA}$ and ${\unicode{x3b7}}_j^{DINO}$ are not distinguished for some items and attribute patterns, the weight value is fixed to a value close to zero or one. See Chiu et al. (Reference Chiu, Sun and Bian2018) for a detailed discussion.

As demonstrated above, parametric and nonparametric estimation methods can be treated in a unified loss function-based framework (C. Ma, de la Torre, & Xu, Reference Ma, de la Torre and Xu2023). However, these loss function-based parameter estimates usually only provide point estimates, and uncertainty quantification in the parameter estimates has been considered less serious. Furthermore, different specifications of the measurement model precipitate significantly different attribute mastery patterns (e.g., Li et al., Reference Li, Hunter and Lei2016). However, assessing all possible measurement models for all test items may be difficult. The GNPC is a promising estimation method that can be used in varied situations, even when the measurement model is unknown. However, prior knowledge of the weight parameters in the GNPC is often not considered. These problems can be solved using the GB method introduced in the following section.

3 GB method for DCMs

3.1 Construction of the generalized posterior

The GB method is a decision theory under a model misspecification situation (Bissiri et al., Reference Bissiri, Holmes and Walker2016). In other words, the assumed model may not accurately represent the true data-generating process, or the relationship between the model parameters and data may not be described via the assumed model, which is known as the $\mathcal{M}$ -open situation (Bissiri et al., Reference Bissiri, Holmes and Walker2016, p. 1111). The GB method is a coherent belief update procedure that uses a loss function even in the $\mathcal{M}$ -open situation. Thus, the GB method extends the applicability of the typical Bayesian methods, which require a likelihood function.

Let datasets and parameter sets be $\boldsymbol{y}$ and $\boldsymbol{\Theta}$ , respectively. Additionally, the loss function and prior distribution are $\ell \left(\boldsymbol{y};\boldsymbol{\Theta} \right)$ and $p\left(\boldsymbol{\Theta} \right)$ . Then, the generalized posterior of the parameter $p\left(\boldsymbol{\Theta} \mid \boldsymbol{y}\right)$ is:

(23)

$$\begin{align}p\left(\boldsymbol{\Theta} \mid \boldsymbol{y}\right)\propto \exp \left(-\unicode{x3c9} \ell \left(\boldsymbol{y};\boldsymbol{\Theta} \right)\right)p\left(\boldsymbol{\Theta} \right),\end{align}$$

where $\unicode{x3c9}$ is called the learning rate, a tuning parameter that controls a dataset’s importance. Methods for determining the learning rate are still being studied (Wu & Martin, Reference Wu and Martin2023); notably, no standard has been established thus far. The generalized posterior function is the result of updating the prior distribution based on the loss function. If we select a negative log-likelihood function for the loss function and $\unicode{x3c9} =1$ , the generalized posterior becomes the usual Bayesian posterior function.

Bissiri et al. (Reference Bissiri, Holmes and Walker2016), pp. 1106–1107) discussed some of the validity requirements for loss functions. First, the solution of the loss function must exist. Second, the loss function must satisfy the following condition:

(24)

$$\begin{align}0<\int \exp \left(-\ell \left(\boldsymbol{y};\boldsymbol{\Theta} \right)\right)p\left(\boldsymbol{\Theta} \right)d\boldsymbol{\Theta} <\infty .\end{align}$$

Some major loss functions considered in this study, such as the Hamming distance, Euclid distance, or cross-entropy loss, satisfy the above integral conditions. Additionally, Bissiri et al. (Reference Bissiri, Holmes and Walker2016), p. 1107) identified natural assumptions for deriving a generalized posterior from a loss function. We should also point out that we only need to construct loss functions given a set of data for only the parameter of interest to employ the GB method. In this article, the GB method employed the loss function for attribute mastery patterns. The loss function is based on the GNPC: quadratic of Euclid distance. It satisfies the above conditions and is valid.

3.2 General form of the GB method for DCMs

The general form of the GB method for DCMs can be expressed using Equations 6 and 23:

(25)

$$\begin{align}p\left(\left\{\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right\}\mid X\right)& \propto \exp \left(-\unicode{x3c9} \left\{\mathcal{L}\left(\left\{\mathcal{A},\boldsymbol{\unicode{x3bc}}, \boldsymbol{\unicode{x3c0}} \right\}\right)\right\}\right)p\left(\boldsymbol{\unicode{x3bc}} \right)p\left(\boldsymbol{\unicode{x3c0}} \right),\nonumber\\&\left.\propto \exp \left(-\unicode{x3c9} \sum\limits_{l=1}^L\sum\limits_{i:{\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_{\boldsymbol{l}}}\left\{\ell \left({\boldsymbol{x}}_i,{\boldsymbol{\unicode{x3bc}}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)+h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)\right\}\right\}\right)p\left(\boldsymbol{\unicode{x3bc}} \right)p\left(\boldsymbol{\unicode{x3c0}} \right).\end{align}$$

The penalty term was $h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right)=-\log {\unicode{x3c0}}_{\unicode{x3b1}_l}$ .

Using the GNPC loss function defined in Equation 22 and adding a penalty term for the mixing parameter $\boldsymbol{\unicode{x3c0}}$ , the generalized posterior is:

(26)

$$\begin{align}p\left(\left\{\mathcal{A},\boldsymbol{\mu}, \boldsymbol{\pi} \right\}\mid X\right)\propto \exp \left(-\unicode{x3c9} \sum\limits_{l=1}^L\sum\limits_{i=1}^I\left\{\mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)\left[\sum\limits_{j=1}^J\kern0.1em {\left({x}_{ij}-{\unicode{x3b7}}_j^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)}^2\right]-\log {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right\}\right)p(W)p\left(\boldsymbol{\unicode{x3c0}} \right).\end{align}$$

Notably, we treat weight $W$ as a parameter and assume a prior instead of a centroid parameter $\boldsymbol{\unicode{x3bc}}$ because the centroid parameter ${\boldsymbol{\unicode{x3b7}}}^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)$ is determined by two ideal responses and weight parameters; thus, it is natural. Priors for the mixing parameters and weight parameters are assumed Dirichlet and Beta distributions:

(27)

$$\begin{align}p\left(\boldsymbol{\unicode{x3c0}} \right)\propto \prod \limits_{l=1}^L{\unicode{x3c0}}_l^{\unicode{x3b4}_l^0-1},\end{align}$$

(28)

$$\begin{align}p(W)\propto \prod \limits_{j=1}^J\kern0.20em \prod \limits_{l=1}^L\kern0.20em {w}_{jl}^{a_{jl}^0-1}{\left(1-{w}_{jl}\right)}^{b_{jl}^0-1},\end{align}$$

where ${\delta}_1^0,{\delta}_2^0,\dots, {\delta}_L^0\ge 0,{\sum}_l\kern0.1em {\unicode{x3b4}}_l^0=1$ , and ${a}_{jl}^0,{b}_{jl}^0\ge 0.$

The posterior was numerically obtained using MCMC techniques, such as Metropolis–Hastings within the Gibbs sampling method, or MCMC software, such as JAGS (Plummer, Reference Plummer2003) or Stan (Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017). The conditional distribution of ${\boldsymbol{\unicode{x3b1}}}_i$ is categorical:

(29)

$$\begin{align}p\left({\boldsymbol{\unicode{x3b1}}}_i\mid {\boldsymbol{x}}_i,W,\boldsymbol{\unicode{x3c0}} \right)\propto \prod \limits_{l=1}^L\kern0.20em {r}_{il}^{\mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)},{r}_{il}=\frac{\unicode{x3c1}_{il}}{\sum_l{\unicode{x3c1}}_{il}},{\unicode{x3c1}}_{il}=\exp \left(-\unicode{x3c9} \mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)\left[\sum \limits_{j=1}^J\kern0.20em {\left({x}_{ij}-{\unicode{x3b7}}_j^{(w)}\left({\boldsymbol{\unicode{x3b1}}}_l\right)\right)}^2\right]-\log {\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right).\end{align}$$

The conditional distribution of the mixing parameters was a Dirichlet distribution:

(30)

$$\begin{align}{\displaystyle \begin{array}{rr}p\left(\boldsymbol{\unicode{x3c0}} \mid X\right)& \propto \end{array}}\prod \limits_{l=1}^L{\unicode{x3c0}}_l^{\unicode{x3b4}_l^{\ast }-1},{\unicode{x3b4}}_l^{\ast}=\unicode{x3c9} \sum\limits_{i=1}^I\mathcal{I}\left({\boldsymbol{\unicode{x3b1}}}_i={\boldsymbol{\unicode{x3b1}}}_l\right)+{\unicode{x3b4}}_l^0.\end{align}$$

The conditional distribution of the weight parameter is not easily expressed; therefore, its MCMC update was performed using the Metropolis–Hastings method. The candidate was generated by a random walk using a uniform distribution: ${w}_{jl}^{\left(\mathrm{cand}\right)}={w}_{jl}^{\left(\mathrm{now}\right)}+u,u\sim \mathrm{Unif}\left(-0.05,0.05\right)$ . Using the above distributions and updating rules, the MCMC for a generalized posterior is numerically approximated as follows: The mixing and weight parameters were initialized as $\left\{{\boldsymbol{\unicode{x3c0}}}^{(0)},{W}^{(0)}\right\}$ , and the hyperparameters were set to ${\unicode{x3b4}}_l^0=1,\forall l$ and ${a}_{jl}^0=2,{b}_{jl}^0=1,\forall j,l$ for example. Then, at the $t$ th MCMC iteration $\left(t=1,2,\dots, T\in \mathbb{N}\right),{\boldsymbol{\unicode{x3b1}}}_i^{\left(t+1\right)}$ is generated from the categorical distribution expressed in Equation 29 with the $t$ th MCMC sample of the parameter set at $\left\{{\boldsymbol{\unicode{x3b1}}}^{(t)},{W}^{(t)}\right\}$ . The $\left(t+1\right)$ th MCMC sample of the mixing parameter is generated from the Dirichlet distribution shown in Equation 30 using ${\mathcal{A}}^{\left(t+1\right)}$ . ${W}^{\left(t+1\right)}$ is obtained using the Metropolis–Hastings method.

Under the hyperparameter setting, the Dirichlet distribution for mixing parameters becomes a uniform distribution. This represents a scenario in which we have no information about the population attribute mastery ratio. In this means, the prior of the attribute mastery pattern has almost no information. The mean and SD of the prior of the weight parameter are 0.667 and 0.236, respectively. Under this setting, interval $\left[0.158,0.987\right]$ covers $95\%$ of the support of the parameter. The data analyst expected the items in the test to have a slightly conjunctive nature, which means the items behave more like in the DINA model than in the DINO model. However, the expectation is not particularly strong because the interval covering $95\%$ of the support of the parameter is wide. This interpretation indicates that the prior conveys some information about the weight parameters.

4 Mathematical properties of the proposed method: Consistency of the MAP estimators

First, we formally introduce the estimators under the GB framework and subsequently discuss their statistical behaviors under certain conditions. The appendix provides the full proofs. In this work, we assume that the item responses were generated from the Bernoulli distribution with parameter $\Theta$ defined by Equation 3; the attribute mastery patterns were generated from a categorical distribution with a mixing parameter $\boldsymbol{\unicode{x3c0}}$ . Although several alternatives exist, MAP estimation provides a relatively natural and simple choice. Furthermore, MAP estimators of the GB method $\left(\widehat{\mathcal{A}},\widehat{\Theta},\widehat{\boldsymbol{\unicode{x3c0}}}\right)$ are estimators of the true parameters $\left({\mathcal{A}}^0,{\Theta}^0,{\boldsymbol{\unicode{x3c0}}}^0\right)$ in the data-generating process. These are obtained by minimizing the loss function of $\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \right)$ under the constraint imposed by the Q-matrix, as follows:

(31)

$$\begin{align}\mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)=\sum \limits_{i=1}^I\kern0.20em \left(\sum \limits_{j=1}^J\kern0.20em \ell \left({X}_{ij},{\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_i}\right)+h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_i}\right)\right)+\sum \limits_{j,l}\kern0.20em \log {f}_{j,{\boldsymbol{\unicode{x3b1}}}_l}\left({\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}\right)+\sum \limits_{l=1}^L\kern0.20em \log {g}_{{\boldsymbol{\unicode{x3b1}}}_l}\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_l}\right),\end{align}$$

where $h\left(\cdotp \right)$ is a continuous nonincreasing regularization function of the proportion parameters ${\unicode{x3c0}}_{\unicode{x3b1}}$ , often taken as $h\left(\unicode{x3c0} \right)=-\log \unicode{x3c0}; {f}_{j,\boldsymbol{\unicode{x3b1}}}$ and ${g}_{\boldsymbol{\unicode{x3b1}}}$ are the prior density functions of ${\unicode{x3b8}}_{j,\unicode{x3b1}}$ and ${\unicode{x3c0}}_{\boldsymbol{\unicode{x3b1}}}$ , respectively. Note that we consider a model sequence indexed by $\left(I,J\right)$ , where both $I$ and $J$ tend to infinity, while $K$ is held constant.

Several regularity conditions are required to ensure the consistency of MAP estimators. The first assumption is as follows.

Assumption 1. There exists ${\unicode{x3b4}}_1,{\unicode{x3b4}}_2>0$ such that

$$\begin{align*}\underset{1\le j\le J}{\min}\kern0.1em \left\{\underset{{\boldsymbol{\unicode{x3b1}}}_l\circ {\boldsymbol{q}}_j^0\ne {\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\circ {\boldsymbol{q}}_j^0}{\min}\kern0.1em {\left({\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}^0-{\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_{l^{\prime}}}^0\right)}^2\right\}\ge {\unicode{x3b4}}_1,\end{align*}$$

and ${\unicode{x3b4}}_2\le {\min}_{j,\unicode{x3b1}}\kern0.1em {\unicode{x3b8}}_{j,\unicode{x3b1}}^0<{\max}_{j,\unicode{x3b1}}\kern0.1em {\unicode{x3b8}}_{j,\unicode{x3b1}}^0\le 1-{\unicode{x3b4}}_2.$

The first condition in Assumption 1 serves as an identification condition for local latent classes at each item level. The gap denoted by $\unicode{x3b4}$ measures the separation between the latent classes, thereby quantifying the signals’ strength. The second condition in Assumption 1 keeps the true parameters away from the boundaries of the parameter space to prevent unusual behaviors of the element-wise loss.

Assumption 2 pertains to the discrete structures of $\mathbf{Q}$ and is expressed as the following.

Assumption 2. All proportion parameters ${\unicode{x3c0}}_{\unicode{x3b1}}$ are strictly greater than zero, and there exist $\left\{{\unicode{x3b4}}_J\right\}\subset \left(0,\infty \right)$ such that

(32)

$$\begin{align}\underset{1\le k\le K}{\min}\kern0.1em \frac{1}{J}\sum \limits_{j=1}^J\kern0.20em \mathcal{I}\left\{{\boldsymbol{q}}_j^0={\boldsymbol{e}}_k\right\}\ge {\unicode{x3b4}}_J.\end{align}$$

This assumption holds that $\mathbf{Q}$ includes an increasing number of identity submatrices, ${\mathbf{I}}_K$ , as $J$ grows. Notably, by attaching the subscript $J$ to the lower bound (32) in Assumption (2), we allow it to decrease to zero as $J$ approaches infinity. As the following theorems show, if the rate at which variable ${\unicode{x3b4}}_J$ decreases meets certain mild requirements, the consistency of $\left(\widehat{\mathcal{A}},\widehat{\Theta}\right)$ can be ensured.

The subsequent assumption concerns the element-wise loss function $\ell$ .

Assumption 3. The loss function $\ell \left(X,\unicode{x3b8} \right)$ is twice continuously differentiable in $\unicode{x3b8}$ on $\left(0,1\right)$ and $\exists {b}_U>{b}_L>0$ such that ${b}_L\le {\partial}_{\unicode{x3b8}^2}\ell \left(R,\unicode{x3b8} \right)\le {b}_U$ for $\unicode{x3b8}$ in a compact subset of $\left(0,1\right)$ . The total loss (31) is minimized at class means given the subjects’ membership, as in, ${\widehat{\unicode{x3b8}}}_{j,\boldsymbol{\unicode{x3b1}}}={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i=\boldsymbol{\unicode{x3b1}} \right\}{X}_{ij}/{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i=\boldsymbol{\unicode{x3b1}} \right\}$ .

Assumption 3 imposes smoothness conditions on the element-wise loss function, rendering it convex. The upper bound of the second derivative is necessary to control the remaining term in the expansion of the first-order condition, and the lower bound allows us to quantify the estimator drift caused by the given priors. For the sample average assumption, we can verify that both ${\ell}^2$ and cross-entropy loss functions satisfy Assumption 3.

Assumption 4 states that the true parameters minimize the element-wise loss functions and quantify the deviations when $\unicode{x3b8}$ is not a true parameter. This assumption is expressed as follows:

Assumption 4. There exist constants $\unicode{x3b7} \ge 2,c>0$ such that

(33)

$$\begin{align}\mathbb{E}\left[\ell \left({X}_{ij},\unicode{x3b8} \right)\right]-\mathbb{E}\left[\ell \left({X}_{ij},{\unicode{x3b8}}_{j,{\unicode{x3b1}}_i^0}^0\right)\right]\ge c{\left|\unicode{x3b8} -{\unicode{x3b8}}_{j,{\unicode{x3b1}}_i^0}^0\right|}^{\unicode{x3b7}}.\end{align}$$

Assumption 4 holds for both the ${\ell}^2$ loss and the cross-entropy loss.

Assumption 5 is a technical assumption that allows us to control the effects of prior distributions on the estimators.

Assumption 5. $h\left(\cdotp \right)$ in (31) is a continuous nonincreasing function of the proportion parameters, and $C>c>0$ exists such that for any $j$ and $\boldsymbol{\unicode{x3b1}}, C>{f}_{j,\unicode{x3b1}},{g}_{\unicode{x3b1}}>c$ on a compact parameter subspace of $\left(0,1\right)$ .

We can verify that the Dirichlet and Beta distributions satisfy this assumption.

Under the aforementioned regularity conditions, we demonstrate the consistency properties of the GB method with constraints for different attribute mastery patterns ${\boldsymbol{\unicode{x3b1}}}_l$ and ${\boldsymbol{\unicode{x3b1}}}_{l^{\prime }},l\ne {l}^{\prime }$ (Ma, de la Torre, & Xu, Reference Ma, de la Torre and Xu2023; Xu, Reference Xu2017):

(34)

$$\begin{align}\left({\boldsymbol{\unicode{x3b1}}}_l\circ {\boldsymbol{q}}_j={\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\circ {\boldsymbol{q}}_j\right)\Longrightarrow \left({\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}={\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}}\right),\end{align}$$

where $\boldsymbol{\unicode{x3b1}} \circ {\boldsymbol{q}}_j=\left({\unicode{x3b1}}_1\cdotp {q}_{j1},\dots, {\unicode{x3b1}}_K\cdotp {q}_{jk}\right)$ denotes the element-wise product of binary vectors $\boldsymbol{\unicode{x3b1}}$ and ${\boldsymbol{q}}_j$ . This implies that the item response parameter ${\unicode{x3b8}}_{j,\boldsymbol{\unicode{x3b1}}}$ depends only on whether the attribute mastery pattern $\boldsymbol{\unicode{x3b1}}$ contains the required attributes ${\mathcal{K}}_j:= \left\{k\in \left[K\right];{q}_{jk}=1\right\}$ for item $j$ .

Based on the above five assumptions, we can derive consistent results for the GB method. The following main theorem first validates the clustering consistency of the GB method under the constraint (34), providing a bound for its convergence rate in recovering the attribute mastery patterns.

Theorem 1 (Clustering Consistency). Consider $\left(\widehat{\mathcal{A}},\widehat{\Theta},\widehat{\boldsymbol{\unicode{x3c0}}}\right)=\arg {\min}_{\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \right)}\kern0.1em \mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)$ under the constraint (34). When $I,J\to \infty$ jointly, suppose $\sqrt{J}=O\left({I}^{1-c}\right)$ for some small constant $c\in \left(0,1\right)$ . Under Assumption 1 to Assumption 5. the clustering error rate is:

(35)

$$\begin{align}\frac{1}{I}\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i\ne {\boldsymbol{\unicode{x3b1}}}_i^0\right\}={o}_p\left(\frac{{\left(\log J\right)}^{\tilde{\unicode{x3b5}}/\unicode{x3b7}}}{\unicode{x3b4}_J{(J)}^{1/\unicode{x3b7}}}\right),\end{align}$$

where for a small positive constant $\tilde{\unicode{x3b5}}>0$ .

Theorem 1 bounds the error of the estimator $\hat{\mathcal{A}}$ , which establishes the clustering consistency of the MAP estimators of the GB method, allowing the rate ${\unicode{x3b4}}_J$ to go to zero. Notably, the scaling condition only assumes that $J$ goes to infinity jointly with $I$ , but at a slower rate.

The following result demonstrates that the MAP estimator of the item parameters can be uniformly estimated consistently as $I,J\to \infty$ :

Theorem 2 (Item parameters consistency). Under Assumptions 1 to 5 and the scaling conditions given in Theorem 1, we have the following uniform consistency result for all $j\in \left[J\right]$ and $\boldsymbol{\alpha} \in {\left\{0,1\right\}}^K:$

(36)

$$\begin{align}\underset{j,\unicode{x3b1}}{\max}\kern0.1em \left|{\widehat{\unicode{x3b8}}}_{j,\unicode{x3b1}}-{\unicode{x3b8}}_{j,\unicode{x3b1}}^0\right|={o}_p\left(\frac{1}{\sqrt{I^{1-\tilde{c}}}}\right)+{o}_p\left(\frac{{\left(\log J\right)}^{\tilde{\unicode{x3b5}}/\unicode{x3b7}}}{\unicode{x3b4}_J{(J)}^{1/\unicode{x3b7}}}\right),\end{align}$$

where $\tilde{c}$ and $\tilde{\epsilon }$ are small positive constants.

On the first error term, the condition ${\unicode{x3c0}}_{\boldsymbol{\unicode{x3b1}}}>0$ for all $\boldsymbol{\unicode{x3b1}} \in {\left\{0,1\right\}}^K$ ensures that with probability one, there are enough samples within each class to provide accurate estimates of item parameters. Notably, $\grave{c}$ the first error term arises because the number of parameters approaches infinity jointly with the sample size $I$ , which causes a slight deviation from the optimal error rate of ${O}_p\left(1/\sqrt{I}\right)$ . The maximum deviation ${\max}_{j,\unicode{x3b1}}\kern0.1em \left|{\widehat{\unicode{x3b8}}}_{j,\unicode{x3b1}}-{\unicode{x3b8}}_{j,\unicode{x3b1}}^0\right|$ is also affected by the classification error. This is indicated in the second error term ${o}_p\left({\left(\log J\right)}^{\tilde{\unicode{x3b5}}}/{\unicode{x3b4}}_J\sqrt{J}\right)$ .

We can easily establish the consistency of the mixing parameter estimator $\hat{\boldsymbol{\unicode{x3c0}}}$ . When $h\left(\unicode{x3c0} \right)=-\log \unicode{x3c0}$ , the mixing parameters will be estimated as the sample average form ${\sum}_i\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}^0=\boldsymbol{\unicode{x3b1}} \right\}/I$ , which converge in probability to ${\unicode{x3c0}}_{\unicode{x3b1}}^0$ because of the clustering consistency.

Corollary 1 (Proportion parameters consistency). Under Assumptions 1 to 5 and the scaling conditions given in Theorem 1, when $h\left(\unicode{x3c0} \right)$ is taken as $-\log \unicode{x3c0}$ , we have ${\widehat{\unicode{x3c0}}}_{\unicode{x3b1}}\overset{P}{\to }{\unicode{x3c0}}_{\unicode{x3b1}}^0$ .

5 Simulation study

This section compares the previous (G)NPC and the corresponding GB methods using the loss functions in NPC and GNPC, named as GBNPC and GBGNPC, respectively. This simulation study primarily aims to assess the behavior of the GB method’s parameter estimates under finite small sample and item situations. As the GBNPC and GNPC are based on loss function in nonparametric methods, the most interesting parameters are attribute mastery patterns. In this simulation study, we mainly focus on the comparisons of the point estimates from these methods. To represent the uncertainty of the estimates, we also present attribute mastery probabilities using the GBGNPC and GBNPC methods, which indicate the benefit of the proposed method against the nonparametric methods.

The code for this simulation study is available on the Open Science Framework (OSF) webpage: https://osf.io/sau6j/.

5.1 Simulation settings

Five factors are manipulated in the simulations. All factors had two conditions; hence, ${2}^5=32$ simulation settings were used. The first factor was the data-generating model: DINA or general DCM (e.g., LCDM). The DINA model condition is a simpler data-generating situation, whereas the general DCM model is more complex. The second factor was the Q-matrix; four or five attributes are listed in Tables 1 and 2. Table 1 contains 19 items: eight simple items (i.e., measuring only one attribute), six items measuring two attributes, five items requiring three attributes, and the most complex item measuring all four attributes. Table 2 lists 30 items: eight simple items, ten items measuring two attributes, and ten items measuring three attributes.

Table 1 The four-attribute $\mathrm{Q}$ -matrix

Table 2 The five-attribute $\mathrm{Q}$ -matrix

Sample size was the third factor, with 30 or 300 participants assumed. The sample size setting of 30 participants mimicked classroom size. The sample size of 300 participants was 10 times larger than that of other classroom settings. The fourth condition was attribute correlation: independent $\left(\unicode{x3c1} =0\right)$ or highly correlated $\left(\unicode{x3c1} =0.8\right)$ . The independent attribute condition was unrealistic but represented an ideal condition. The highly correlated condition was more realistic because the DCMs application indicated a high correlation among attributes (e.g., von Davier, Reference von Davier2008). The fifth condition was item quality. The high-item-quality condition indicates a high description of all nonmastering attributes and all perfectly mastering attributes. Under the high-item-quality condition, the correct response probability of the all nonmastering pattern was 0.1 and that of the all-mastering pattern was 0.9. On the other hand, in the low-item-quality condition, the corresponding probabilities were 0.3 and 0.7. The correct item response probabilities of the intermediate mastering patterns are generated based on Yamaguchi and Templin (Reference Yamaguchi and Templin2022b) or Yamaguchi and Templin (Reference Yamaguchi and Templin2022a).

The data generation process used herein was similar to those in previous studies, such as Chiu and Douglas (Reference Chiu and Douglas2013), Yamaguchi and Templin (Reference Yamaguchi and Templin2022b), and Yamaguchi and Templin (Reference Yamaguchi and Templin2022a). First, for each individual, we generated a continuous latent variable vector ${\tilde{\boldsymbol{\unicode{x3b1}}}}_i={\left({\tilde{\unicode{x3b1}}}_{i1},\dots, {\tilde{\unicode{x3b1}}}_{iK}\right)}^{\top }$ from $K$ -dimensional normal distributions with zero means and compound symmetry covariance with a correlation of 0 or 0.8, and variances of 1. Subsequently, the continuous latent variable vector ${\grave{\unicode{x3b1}}}_{i1}$ was converted into an attribute mastery pattern. More precisely, if ${\tilde{\unicode{x3b1}}}_{ik}$ was greater than $\Phi {\left(k/\left(1+K\right)\right)}^{-1},{\unicode{x3b1}}_{ik}=1$ ; otherwise, ${\unicode{x3b1}}_{ik}=0$ , where $\Phi {\left(\cdotp \right)}^{-1}$ is the inverse cumulative normal distribution function. The simulated item responses were randomly generated using these attribute mastery patterns, an assumed data-generating model (DINA or general DCM), and item response probabilities. As mentioned in the previous section, the parameters of the priors in the GB method were set to ${a}_{jl}^0=2,{b}_{jl}^0=1,\forall j,l$ and ${\delta}_l^0=1,\forall l$ . The step size of the Metropolis update was fixed at 0.05. A one-chain MCMC with 1,000 iterations was employed. The first 500 iterations are discarded as the burn-in period; therefore, 500 MCMC samples were used to approximate the posterior distributions.

The main target parameter is attribute masteries, and they are categorical latent variables. However, common MCMC convergence criteria, such as Gelman-Rubin’s $\grave{R},$ are for continuous variables, which means the indicators may not be applicable to categorical variables. Therefore, performing a convergence check of categorical variables in MCMC is not easy in this context. Instead of directly checking for the convergence of attribute mastery, we calculated the average correlations of the attribute mastery probabilities, which we estimated for the first and second halves of the MCMC iterations after the burn-in period. If the estimated results of the attribute mastery probabilities with the first half after the burn-in period are consistent with those of the later MCMC iterations, we consider the attribute mastery results to be stable.

The attribute mastery pattern of the $i$ -individual was calculated based on the posterior attribute mastery probabilities. If the probability of the $k$ th attribute was <0.5, the attribute was considered mastered. Each estimation method was evaluated using two attribute mastery recovery indices: attribute-level agreement ratio (AAR) and pattern-level agreement ratio (PAR). AAR and PAR were calculated as follows:

(37)

$$\begin{align}{\mathrm{AAR}}_k=\frac{1}{IM}\sum \limits_{m=1}^M\kern0.20em \sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left({\hat{\unicode{x3b1}}}_{ik}^{(m)}={\unicode{x3b1}}_{ik}^{\left(\mathrm{True}\;\right)}\right),\forall k\end{align}$$

(38)

$$\begin{align}\mathrm{PAR}=\frac{1}{IM}\sum \limits_{m=1}^M\kern0.20em \sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left({\hat{\boldsymbol{\unicode{x3b1}}}}_i^{(m)}={\boldsymbol{\unicode{x3b1}}}_i^{\left(\mathrm{True}\;\right)}\right),\end{align}$$

where ${\boldsymbol{\hat{\unicode{x3b1}}}}_i^{(m)}={\left({\hat{\unicode{x3b1}}}_{i1}^{(m)},{\hat{\unicode{x3b1}}}_{i2}^{(m)},\dots, {\hat{\unicode{x3b1}}}_{iK}^{(m)}\right)}^{\top }$ is an estimate of the attribute mastery pattern for individual $i$ in the $m$ th simulation, and ${\boldsymbol{\unicode{x3b1}}}_i^{\left(\mathrm{True}\;\right)}={\left({\unicode{x3b1}}_{i1}^{\left(\mathrm{True}\;\right)},{\unicode{x3b1}}_{i2}^{\left(\mathrm{True}\;\right)},\dots, {\unicode{x3b1}}_{iK}^{\left(\mathrm{True}\;\right)}\right)}^{\top }$ is the true attribute vector of individual $i$ , where $M$ is the total number of simulations, which is $M=100$ .

5.2 Results

Table 3 shows the results, which indicate correlations >0.98. Therefore, this result can be interpreted as an indication that the MCMC iterations were stable and attribute mastery can be estimated from the MCMC samples after the burn-in period.

Table 3 The average correlations of attribute mastery probabilities estimated by first and second halves of MCMC iterations after the burn-in period

Note: GBGNPC, generalized Bayesian method with generalized nonparametric loss function; GBNPC, generalized Bayesian method with nonparametric loss function.

Figures 1 and 2 present the simulation results of the DINA data generation with four- and five-attribute Q-matrix conditions, respectively. In this simulation, the AARs and PARs of the two Q-matrix conditions demonstrated similar tendencies; therefore, our discussion here focuses on the four-attribute Q-matrix condition. The high-item-quality conditions presented in the four left panels of Figure 1 indicated that all four estimation methods provide high AARs and PARs. The low-item-quality conditions presented in the four right panels of Figure 1 indicated lower AARs and PARs than the high-item-quality conditions, and the low-item-quality conditions exhibited some differences among the four estimation methods. Attribute correlations under low-item-quality conditions affected AARs and PARs more significantly. Furthermore, GBNPC and GBGNPC demonstrated higher AARs and PARs than the corresponding NPC and GNPC methods under the 30-sample size, 0.8 attribute correlation, and low-item-quality conditions. Interestingly, GBNPC exhibited the highest AARs and PARs under the 300-sample size, 0.8 attribute correlation, and low-item-quality conditions. Moreover, under these conditions, GBGNPC had similar AARs and PARs to the NPC, and the GNPC produced the least optimal result.

Figure 1 Simulation results of the DINA data generation with four-attribute $\mathbf{Q}$ -matrix conditions.

Figure 2 Simulation results of the DINA data generation with five-attribute $\mathbf{Q}$ -matrix conditions.

Figures 3 and 4 present the results of the general DCM data generation with four- and five-attribute Q-matrix conditions, respectively. Again, the AARs and PARs of the two Q-matrix conditions exhibited similar patterns; hereafter, we predominantly focus on results of the four-attribute Q-matrix conditions. Under high-item-quality conditions, GNPC and GBGNPC outperformed NPC and GBNPC. Furthermore, under high-attribute correlation conditions, GBGNPC was superior to GNPC; the same pattern was observed between GBNPC and NPC under the same conditions. In the low-item-quality conditions presented in the four right panels of Figure 3, GBGNPC and GBNPC tended to have higher AARs and PARs than GNPC and NPC. In particular, a sample size of 300, a high-attribute correlation, and low-item-quality conditions indicated better AARs and PARs for GBGNPC and GBNPC than GNPC or NPC.

Figure 3 Simulation results of the general DCM data generation with four-attribute Q-matrix conditions.

Figure 4 Simulation results of the general DCM data generation with five-attribute Q-matrix conditions.

We also checked attribute mastery probabilities of the GBGNPC and GBNPC methods that represented uncertainty of parameter estimates. Figures 5 and 6 represent box plots of average attribute mastery probabilities of the four- and five-attribute conditions under the DINA model-based data-generating process. Interestingly, the GBNPC method tended to show higher average attribute mastery probabilities than the GBGNPC. The differences between the GBGNPC and GBNPC methods were relatively small in the first attribute but the discrepancy became larger as the attribute number increased. The later attributes were more difficult to master and the number of individuals mastering them was small. These tendencies also occurred in the general data-generating process situations, which are shown in Figures 7 and 8. These posterior probabilities of attribute mastering represent estimation uncertainty, so we can carefully check the attribute mastery status. For example, the attribute mastery probabilities around cutoff values might represent indeterminacy of mastery or nonmastery. Such uncertainty quantification results cannot be obtained through the GNPC or NPC methods.

Figure 5 Box plots of attribute mastery probabilities of the DINA data generation with four-attribute Q-matrix conditions.

Figure 6 Box plots of attribute mastery probabilities of the DINA data generation with five-attribute Q-matrix conditions.

Figure 7 Box plots of attribute mastery probabilities of the general data generation with four-attribute $\mathbf{Q}$ -matrix conditions.

Figure 8 Box plots of attribute mastery probabilities of the general data generation with five-attribute Q-matrix conditions.

In summary, NPC and GBNPC tended to have higher AARs and PARs under DINA data generation, low-item-quality conditions, and high-attribute correlations. However, GBGNPC was sometimes similar to NPC under the DINA data generation conditions, whereas GNPC was the least optimal. By contrast, under general DCM data generation conditions, GBGNPC and GNPC performed better than GBNPC and GNPC for high-quality items. For low-quality items, GBGNPC and GBNPC performed better. Based on these results, GBGNPC appears the optimal choice for attribute mastery estimation. If the DINA type item response mechanism is confirmed, GBNPC is the optimal choice among the four estimation methods from the perspective of attribute recovery.

The possible reason for the superiority of the GBGNPC over the GNPC is prior settings. In our simulation setting, sample sizes were relatively small in the situations in which the nonparametric methods were employed. Under such conditions, estimation of weight parameters might be difficult for the GNPC, especially under the low-item-quality conditions. The GBGNPC, on the other hand, assumed priors for the weight parameters, and the prior conveyed information of item characteristics and succeeded in estimating attribute mastery patterns. Another reason may be that the GBGNPC can deal with uncertainty in parameter estimation. This means that the GNPC uses parameter estimates to minimize the loss function, which simply selects the attribute mastery pattern that provides the minimum value of loss function without considering the second or third best attribute mastery patterns. By contrast, the GBGNPC can consider and use the second-best attribute mastery pattern for estimating attribute mastery probabilities. If these considerations are correct, even if we use noninformation priors for the GNPC method, the GBGNPC may remain superior. The effects of prior settings are also an important topic for detailed research in future studies.

In addition, the effects of manipulated factors are discussed here. First, attribute correlation affected attribute mastery recovery. The GB method provided better attribute mastery recovery than the nonparametric methods. The nonparametric loss could not include such information, but the generalized posteriors have such information based on the data. The attributes generally correlate with each other, making the GB methods generally better than the nonparametric methods. We did not explicitly include a loss related to the attribute correlations, but the GB method allows us to include a loss term of the attribute correlations or attribute structure. This may be a good extension for constructing a loss function for the GB method.

Second, item quality also affected attribute mastery recovery results. The GB methods showed better results than the nonparametric methods, especially under low-item-quality conditions. Under such conditions, prior information might help to improve attribute recovery. This means the GB method can utilize not only the loss function but also prior information. This makes the GB method the preferred method compared to the current nonparametric methods, which cannot do this. Thus, based on the simulation study, the GB methods are always better than the nonparametric methods from the perspective of attribute mastery recovery.

6 Real data example

The real data example aimed to compare the four estimation methods used in the simulation study and examine how these estimation methods provide different attribute mastery results. This real data comparison provided an example of the behavior of the proposed GB method for DCMs.

To show the superiority of their proposed methods, Ma and Jiang (Reference Ma and Jiang2021) used $k$ -fold cross-validation with the log marginal likelihood. From our understanding, the log marginal likelihood does not contain individual parameters that are attribute mastery patterns. In the cross-validation procedure, model parameters estimated with a training dataset are plugged in to calculate the log marginal likelihood of the test dataset. In our context, the loss functions in the GB method and nonparametric methods do not contain model parameters and only estimate attribute mastery patterns that relate to individuals. Therefore, the attribute mastery patterns in a training data set are not contained in a test dataset. Exploring the appropriate quantitative evaluation for the GB method in the DCMs is an important direction for future research.

6.1 Data analysis settings

The Examination of the Certificate of Proficiency in English (ECPE) data were selected as an example. ECPE data have been analyzed in various previous studies, such as Templin and Hoffman (Reference Templin and Hoffman2013) and Templin and Bradshaw (Reference Templin and Bradshaw2014). The ECPE data contained 2,922 responses for 28 items. Table 4 presents a $28\times 3$ Q-matrix that assumes three attributes: Morphosyntactic $\left({\unicode{x3b1}}_1\right)$ , cohesive $\left({\unicode{x3b1}}_2\right)$ , and lexical rules $\left({\unicode{x3b1}}_3\right)$ . The settings of the GB methods were the same as those used in previous simulations. One difference was that we employed GNPC and NPC estimates as initial values for GBGNPC and GBNPC. The data analysis code can be obtained from the OSF webpage https://osf.io/sau6j/.

Table 4 The $\mathrm{Q}$ -matrix of ECPE data

6.2 Results

The same correlations as in the simulation study were calculated. Again, the correlations of the three attributes with the GBNPC and GBGNPC methods were all greater than 0.99. This indicated that the MCMC iterations for attribute mastery were stable.

Table 6 lists the frequencies and ratios of the attribute mastery patterns for the four estimation methods. Several differences are observed in Table 6. First, GBGNPC and GBNPC estimated the pattern (001) to be lower than GNPC and NPC estimates. Second, as pattern (011) indicates, the frequency of pattern (011) for GBGNPC was the highest (1203), that for the GNPC was the second (955), that for the NPC was the third (522), and that for the GBNPC was the last (386). The GBGNPC and GBNPC produced lower frequencies than the GNPC and NPC for patterns (100), (101), and (110). The final difference is indicated in pattern (111). GBGNPC and GNPC had relatively smaller numbers than GBNPC and NPC.

Table 5 shows the means and SDs of the attribute mastery probabilities for the GBGNPC and GBNPC methods. The attribute mastery probability for the first attribute (Morphosyntactic rules) of GBGNPC was mean $=.551\left(\mathrm{SD}=.388\right)$ and that of GBNPC was mean $=.807\left(\mathrm{SD}=.326\right)$ . The discrepancy was the largest among the three attributes. The attribute mastery probabilities for the second (cohesive rules) and third (lexical rules) attributes using the GBGNPC and GBNPC methods were higher than 0.90 so these attributes tended to be mastered.

Table 5 Means and SDs of posterior attribute mastery probabilities for GBGNPC and GBNPC methods

Table 6 Frequencies and ratios of the estimated attribute mastery patterns with the four estimation methods

Note: GBGNPC, generalized Bayesian method with generalized nonparametric loss function; GB-NPC, generalized Bayesian method with nonparametric loss function; GNPC, generalized nonparametric method; NPC, nonparametric method.

Table 7 shows the estimated attribute mastery patterns of GBGNPC and GNPC. A large portion of the GBGNPC pattern (011) corresponds to patterns (001), (001), and (010) of the GNPC. Furthermore, patterns (011), (100), (101), and (110) of the GNPC correspond to pattern (111) of the GBGNPC. From these results, the GBGNPC tended to overestimate the number of attributes compared with the GNPC.

Table 7 Contingency table of the estimated attribute mastery patterns by GBGNPC and GNPC

Note: GBGNPC, generalized Bayesian method with generalized nonparametric loss function; GNPC, generalized nonparametric method.

Table 8 presents the GBNPC’s and NPC’s estimated attribute mastery patterns. The results in Table 8 are similar to those of GBGNPC and GNPC. For example, patterns (000), (001), (010), and (010) with the NPC are sometimes estimated as pattern (011) in GBGNPC. Furthermore, patterns (000) to (110) in the NPC were classified as pattern (111) in the GBGNPC. Therefore, the GBNPC overestimates the number of attributes compared with the NPC.

Table 8 Contingency table of the estimated attribute mastery patterns by GBNPC and NPC

Note: GBNPC, generalized Bayesian method with nonparametric loss function; NPC, nonparametric method.

We checked individual differences between the GBGNPC and GNPC methods. Table 9 shows that some individuals indicated the largest pattern discrepancy of attribute mastery between GBGNPC and GNPC methods. The GBGNPC and the GNPC provided $\boldsymbol{\unicode{x3b1}} =\left(0,1,1\right)$ and $\boldsymbol{\unicode{x3b1}} =\left(1,0,0\right)$ , respectively. The response patterns did not indicate systematic tendency but the sum scores of the individuals ranged from 11 to 15, which meant they could answer more than half of the test items. The maximum subscores for attributes one, two, and three were 13, 6, and 18, respectively, so the individuals in Table 9 received half points out of the maximum total subscores. In addition, the sum scores of the individuals ranged from $11\;\mathrm{to}\;15$ , which is about half the maximum sum-score of 28. Thus, the pattern $\left(1,0,0\right)$ might saliently underestimate the latent attributes, making the pattern $\left(0,1,1\right)$ possibly more likely. Furthermore, some attribute mastery probabilities were close to the cutoff value 0.5. For example, the mastery probability of the third attribute for the ID 813 individual was 0.516. Additionally, the mastery probabilities of the first attribute for the ID 1060 and 2378 individuals were 0.418 and 0.420. Furthermore, the attribute mastery probability of the third attribute for the ID 2607 individual was 0.556. These values might indicate the mastery of corresponding attributes was not strongly supported. The posterior probabilities for the proposed GBGNPC and GBNPC methods can be used in such cases. However, this is not possible if we use typical nonparametric methods.

Table 9 Individual differences in estimated patterns for GBGNPC and GNPC methods, response patterns, sum- and subscores, and attribute mastery probabilities

Note: GBGNPC, generalized Bayesian method with a generalized nonparametric loss function; GBNPC, generalized Bayesian method with a nonparametric loss function.

The attribute mastery probabilities information provides estimation uncertainty of mastery and nonmastery of an attribute for an individual. It may be better to empathize even if we judge an individual to have mastered an attribute as the mastery might be just slightly over the cutoff value. The nonparametric methods cannot provide such information. Therefore, it may be better to introduce the third category representing the midpoint between mastery and nonmastery in DCM applications.

In addition to each attribute mastery probability, we also added posterior attribute mastery pattern probabilities with the GBGNPC and GBNPC methods in Table 10. The individual’s posterior attribute pattern probabilities represent the relative possibilities of attribute mastery patterns. From Table 10, we can see that some attribute patterns showed almost the same posterior probabilities. For example, individual ID 2378 indicated relatively high posterior probabilities 0.574 and 0.406 for (011) and (111) according to the GBGNPC method. A similar tendency was shown by the ID 1060 students with the GBGNPC method. The posterior based on the GBNPC method provided more nuanced estimates for the ID 813 student. This individual had similar posterior probabilities for (000), (010), and (110), with values of $0.282,0.316$ , and 0.250, respectively. It may not be better to provide diagnostic feedback with such unstable posterior probabilities. Previous nonparametric methods cannot provide such uncertainty information, which can be used for careful diagnosis of attribute mastery.

Table 10 Generalized posterior of attribute mastery pattern by GBNPC and NPC

Note: GBGNPC: generalized Bayesian method with a generalized nonparametric loss function, GBNPC: generalized Bayesian method with a nonparametric loss function.

7 Discussions and future directions

This study extends the loss function-based estimation method proposed by Ma, de la Torre, and Xu (Reference Ma, de la Torre and Xu2023) to the GB method, which considers estimation uncertainty and prior knowledge. The proposed estimation method can be used for any type of loss function and has great flexibility. This study’s contribution is that the proposed method provides a novel approach for estimating the DCMs’ parameters. The GB method is flexible because we can select any type of loss function and consider the uncertainty of the parameter estimation. Furthermore, the proposed method relaxes the assumption of the typical Bayesian method, which requires a likelihood function. The theoretical analysis revealed consistent results for the proposed GB method under mild regularity conditions. Additionally, the simulation study revealed that the GB method improved attribute mastery recoveries compared to previous nonparametric methods. The real data example indicated that the proposed GB method with the nonparametric loss function tended to overestimate attribute mastery compared to the nonparametric methods.

The theoretical results not only guarantee the consistency of the MAP estimation results but also give convergence rate results, which is helpful in characterizing the finite sample estimate errors. All these results are new to the literature and provide theoretical justification for using the nonparametric methods and the proposed GB approach. Moreover, the theoretical results in the paper are established for the general loss function under the proposed assumptions. It covers popular loss functions, such as the GNPC and log-likelihood loss functions, which are used in Ma, de la Torre, and Xu (Reference Ma, de la Torre and Xu2023).

One interesting future research problem is to establish consistent results for other Bayesian estimators, such as expected a posteriori (EAP). However, this is a more challenging question as it involves deriving the limiting distribution of the Bayesian posterior distribution. Intuitively, given our theoretical results of MAP, EAP would also be consistent, but technically this is not easy to determine and needs the development of new mathematical tools. Moreover, Assumption 2 may be further relaxed to allow for some latent attribute mastery patterns that do not exist in the population. In particular, if we know which attribute mastery patterns have zero probability, such as in hierarchical DCMs, then our theoretical results would still apply. However, if this information is unknown, while some latent attribute mastery patterns have zero probability, the model itself may have some identifiability issues under the nonparametric DCM setting. This is another interesting topic for future study.

Another future research direction is to explore how to determine the learning rate from data, especially under the $\mathcal{M}$ -open setting. Intuitively, the learning rate controls the relative importance between prior information and the loss function. We can set a relatively small value for the learning rate if we have enough prior information about the attribute mastery distribution and use several new items whose nature we do not know. In this case, we put relatively great importance on the prior information rather than the obtained data. However, it may not be realistic to set the learning rate greater than one. Such a high learning rate would amplify the effect of the loss function but might indicate an overreliance on the data. It may not be suitable for the $\mathcal{M}$ -open setting that the data-generating process is unknown. Therefore, we need to explore how to determine the learning rate from data.

As mentioned previously, no scholarly agreement exists regarding how to determine the learning rate, which is an important topic for future research especially in the DCM context. In particular, data-driven learning rate determination procedures were studied in Wu and Martin (Reference Wu and Martin2023), where several selection methods such as the SafeBayes algorithm based on the cumulative log-loss (Grünwald & van Ommen, Reference Grünwald and van Ommen2017), information gain perspective (Holmes & Walker, Reference Holmes and Walker2017), modified weighted likelihood bootstrap approach (Lyddon et al., Reference Lyddon, Holmes and Walker2019), and the approximate achievement of nominal frequentist coverage probability (Syring & Martin, Reference Syring and Martin2019) were compared. However, all of these methods have different foundations, and we need to explore which one is most appropriate for the DCM context.

Another topic that requires further investigation is model data fit evaluation. From our understanding, the GB method avoids explicit model representation in the framework. Therefore, the model evaluation scheme is not included in the procedure of the GB method. This is also true for the GBGNPC method proposed in this study. Therefore, future research needs to explore what kind of statistics can be used for model data fit. In particular, previously developed methods of model data fit assessment in psychometrics and Bayesian data analysis could be employed in our setting. Following Sinharay (Reference Sinharay2006), discrepancy measures such as observed score distribution, point biserial correlation, and statistical measures of association among the item pairs could be used for posterior predictive model checking (PPMC). For further details on PPMC methods for Bayesian networks and IRT models, see also Sinharay (Reference Sinharay, Johnson and Stern2006) and Sinharay (Reference Sinharay and van der Linden2016). Moreover, PPMC for person fit (Sinharay, Reference Sinharay2015) would also provide an important measure to assess the model fit for the attribute mastery patterns at the personal level, which is often of interest in cognitive diagnosis.

As a final note about the choice of estimation methods, it is necessary to consider estimation time. The GB method employs an MCMC procedure, so it has a longer estimation time than that of the nonparametric methods. In our simulation, the estimation times were less than ten seconds, so it is not irritatingly time consuming. However, if we need immediate feedback, the time difference between the two kinds of methods may be crucial. We also need to consider estimation time for the requirement of real data analysis.

Data availability statement

The data analysis code is available in the Open Science Framework page: https://osf.io/sau6j/.

Funding statement

This work was supported by JSPS KAKENHI 20H01720, 21H00936, 22K13810, 23H00985, 23H00065, and 24K00485.

Competing interests

The authors declare no conflicts of interest.

1 Appendix

Proofs of Theorems 1 and 2

A.1 Preparation for the Proofs

In this appendix, we provide some basic tools and introduce helpful notations for the proofs of Theorems 1 and 2. The proofs are presented in the subsequent sections.

Motivated by the constraint (34), we introduce the concept of a “local” latent class at the item level. Considering item $j$ with q-vector ${\boldsymbol{q}}_j$ , the constraint (34) divides the collection of attribute mastery profiles $\boldsymbol{\unicode{x3b1}}$ , which is ${\left\{0,1\right\}}^K$ , based on an equivalence relationship where ${\boldsymbol{\unicode{x3b1}}}_l{\sim}_j{\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}$ is defined by ${\boldsymbol{\unicode{x3b1}}}_l\circ {\boldsymbol{q}}_j={\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\circ {\boldsymbol{q}}_j$ ; here, the subscript ${\sim}_j$ emphasizes that the equivalence relationship is determined by the $j$ th item ${\boldsymbol{q}}_j$ . On this basis, we introduce a function $\unicode{x3be} :{\left\{0,1\right\}}^K\times {\left\{0,1\right\}}^K\to \mathbb{N}$ where $\unicode{x3be} \left({\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_l\right)=\unicode{x3be} \left({\boldsymbol{q}}_j,{\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\right)$ is equivalent to ${\boldsymbol{\unicode{x3b1}}}_l\circ {\boldsymbol{q}}_j={\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\circ {\boldsymbol{q}}_j$ . This function assigns numbers to the equivalent classes induced by item $j$ based on specific rules. In the following context, we refer to $\unicode{x3be} \left({\boldsymbol{q}}_j^0,\boldsymbol{\unicode{x3b1}} \right)$ as the local latent class of $\boldsymbol{\unicode{x3b1}}$ induced by item $j$ . It is straightforward to verify that the number of local latent classes induced by item $j$ , denoted by $\left|\unicode{x3be} \left({\boldsymbol{q}}_j,{\left\{0,1\right\}}^K\right)\right|$ , is equal to ${L}_j={2}^{K_j}$ . Here, ${K}_j={\sum}_{k=1}^K\kern0.1em {q}_{jk}^0$ represents the number of latent attributes required for item $j$ ; consequently, the range of function $\unicode{x3be}$ satisfies $\unicode{x3be} \left({\boldsymbol{q}}_j,{\left\{0,1\right\}}^K\right)=\left[{L}_j\right]:= \left\{1,\dots, {L}_j\right\}$ . As the local latent classes are identified up to permutations on $\left[{L}_j\right]$ , owing to their categorical nature, the mapping rules between $\unicode{x3be} \left({\boldsymbol{q}}_j,{\left\{0,1\right\}}^K\right)$ and $\left[{L}_j\right]$ need not be completely specified in our discussion.

For brevity, we use the general notation $\mathbf{Z}=\left({z}_{ij}\right)$ to denote the collection of local latent classes for all items $j\in \left[J\right]$ and subjects $i\in \left[I\right]$ , where ${z}_{ij}$ represents $\unicode{x3be} \left({\boldsymbol{q}}_j^0,{\boldsymbol{\unicode{x3b1}}}_i\right)$ . Given that $\unicode{x3be} \left({\boldsymbol{q}}_j^0,{\boldsymbol{\unicode{x3b1}}}_l\right)=\unicode{x3be} \left({\boldsymbol{q}}_j^0,{\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}\right)$ implies ${\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_l}={\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_{l^{\prime }}}$ by the definition of $\unicode{x3be}$ , we express ${\unicode{x3b8}}_{j,{\boldsymbol{\unicode{x3b1}}}_i}$ as ${\unicode{x3b8}}_{j,{z}_{ij}}$ to directly incorporate the constraint (34) into the loss function (31). For notational simplicity, we may write ${\unicode{x3b8}}_{j,{z}_{ij}}$ as ${\unicode{x3b8}}_{j,{z}_i}$ . Consequently, we define:

(A.1)

$$\begin{align}{P}_{ij}=P\left({X}_{ij}=1\right)={\unicode{x3b8}}_{j,{z}_i^0}^0.\end{align}$$

Then, the loss function (31) can be rewritten as:

(A.2)

$$\begin{align}\mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)=\sum \limits_{i=1}^I\kern0.20em \left(h\left({\unicode{x3c0}}_{{\boldsymbol{\unicode{x3b1}}}_i}\right)+\sum \limits_{j=1}^J\kern0.20em \ell \left({X}_{ij},{\unicode{x3b8}}_{j,{z}_i}\right)\right)+\sum \limits_{j,a}\kern0.20em \log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)+\sum \limits_{\boldsymbol{\unicode{x3b1}}}\kern0.20em \log {g}_{\boldsymbol{\unicode{x3b1}}}\left({\unicode{x3c0}}_{\boldsymbol{\unicode{x3b1}}}\right),\end{align}$$

where $a\in \left[{L}_j\right]$ . Observe that ${X}_{ij}^2={X}_{ij}$ , and $\mathbb{E}\left[{X}_{ij}\right]={P}_{ij}$ , we denote the expectation of the above $\mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)$ by $\overline{\mathcal{L}}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \right):= \mathbb{E}\left[\mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)\right]$ .

Notably, $\mathbf{Z}=\left({z}_{ij}\right)$ is determined only by $\mathcal{A}$ because ${\mathbf{Q}}^0$ is known. In the subsequent context, the quantities determined by the latent attribute profiles $\mathcal{A}$ are sometimes denoted by the superscript $\mathcal{A}$ to emphasize their relationships with $\mathcal{A}$ . Considering an arbitrary $\mathcal{A}$ , we denote it as:

(A.3)

$$\begin{align}\mathcal{L}\left(\mathcal{A}\right)=\underset{\Theta, \unicode{x3c0}}{\operatorname{inf}}\kern0.1em \mathcal{L}\left(\mathcal{A},\Theta, {\boldsymbol{\unicode{x3c0}}}^{\left(\mathcal{A}\right)}\mid X\right)=\mathcal{L}\left(\mathcal{A},{\widehat{\Theta}}^{\left(\mathcal{A}\right)},{\widehat{\boldsymbol{\unicode{x3c0}}}}^{\left(\mathcal{A}\right)}\mid X\right),\end{align}$$

(A.4)

$$\begin{align}\overline{\mathcal{L}}\left(\mathcal{A}\right)=\overline{\mathcal{L}}\left(\mathcal{A},{\overset{\leftharpoonup }{\Theta}}^{\left(\mathcal{A}\right)},{\widehat{\boldsymbol{\unicode{x3c0}}}}^{\left(\mathcal{A}\right)}\right),\end{align}$$

where $\left({\widehat{\Theta}}^{\left(\mathcal{A}\right)},{\widehat{\boldsymbol{\unicode{x3c0}}}}^{\left(\mathcal{A}\right)}\right):= \arg {\min}_{\Theta, \unicode{x3c0}}\kern0.1em \mathcal{L}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \mid X\right)$ and the definition of ${\overset{\leftharpoonup }{\Theta}}^{\left(\mathcal{A}\right)}$ is provided later. Notably, $\left({\overset{\leftharpoonup }{\Theta}}^{\left(\mathcal{A}\right)},{\widehat{\boldsymbol{\unicode{x3c0}}}}^{\left(\mathcal{A}\right)}\right)$ may not minimize $\overline{\mathcal{L}}\left(\mathcal{A},\Theta, \boldsymbol{\unicode{x3c0}} \right)$ for a given $\mathcal{A}$ . Then, under any realization of $\mathcal{A}$ , if the prior distribution of $\Theta$ is uniform, the following equations hold for any local latent class $a\in \left[{L}_j\right]$ :

(A.5)

$$\begin{align}{\widehat{\unicode{x3b8}}}_{ja}^{\left(\mathcal{A}\right)}=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^{\left(\mathcal{A}\right)}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^{\left(\mathcal{A}\right)}=a\right\}},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}^{\left(\mathcal{A}\right)}=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^{\left(\mathcal{A}\right)}=a\right\}{P}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^{\left(\mathcal{A}\right)}=a\right\}}.\end{align}$$

To derive (A.5), note the sum ${\sum}_{j=1}^J\kern0.1em {\sum}_{i=1}^I\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{j,{z}_i}\right)$ equals the sum ${\sum}_{j=1}^J\kern0.1em {\sum}_{a=1}^{L_j}\kern0.1em {\sum}_{z_i=a}\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)$ . When estimating ${\widehat{\unicode{x3b8}}}_{ja}$ , we focus on minimizing ${\sum}_{z_i=a}\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)$ . By substituting $\mathbb{E}\left[{X}_{ij}\right]={P}_{ij}$ into (A.5), we find that $\mathbb{E}\left[{\hat{\unicode{x3b8}}}_{ja}\right]={\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}$ holds for any $\left(j,a\right)$ . In the following section, we use the second formula in (A.5) to define ${\overset{\leftharpoonup }{\Theta}}^{\left(\mathcal{A}\right)}$ given in (A.4). When the prior distribution is not a uniform distribution, ${\widehat{\unicode{x3b8}}}_{ja}$ is obtained from minimizing ${\sum}_{z_i=a}\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)+\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)$ , where ${f}_{ja}\left({\unicode{x3b8}}_{ja}\right)$ is the prior density of ${\unicode{x3b8}}_{ja}$ . To avoid ambiguity, we denote ${\hat{\unicode{x3b8}}}_{ja}:= \arg {\min}_{\unicode{x3b8}}\kern0.1em {\sum}_{z_i=a}\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)+\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)$ , and ${\tilde{\unicode{x3b8}}}_{ja}:= \arg {\min}_{\unicode{x3b8}}\kern0.1em {\sum}_{z_i=a}\kern0.1em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)$ . It is clear that $\mathbb{E}\left[{\tilde{\unicode{x3b8}}}_{ja}\right]={\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}$ .

Before discussing the details of our proof, we provide technical remarks to simplify the discussion. Notably, although we assume that the latent attributes have proportion parameters ${\boldsymbol{\unicode{x3c0}}}^0$ , they are still treated as unknown but fixed parameters that need to be estimated. As all the proportion parameters ${\unicode{x3c0}}_{\boldsymbol{\unicode{x3b1}}}^0$ are strictly greater than zero, with the probability converging to $1,{\unicode{x3b5}}_1>0$ exists such that ${\min}_{\boldsymbol{\unicode{x3b1}}}\kern0.1em {\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0=\boldsymbol{\unicode{x3b1}} \right\}\ge I{\unicode{x3b5}}_1$ . Subsequently, we use this fact interchangeably with the first condition in Assumption 2.

The second point concerns the compact parameter space specified in Assumptions 2 and 3. Some loss functions may exhibit unusual behavior near the boundary of the parameter space. Although Assumption 2 confines the true item parameters to a compact subset within $\left(0,1\right)$ , the estimated item responses can still approach zero or one, making theoretical analysis more difficult. For any pair $\left(j,a\right),{\unicode{x3b8}}_{ja}$ lies within $\left[{\unicode{x3b4}}_2,1-{\unicode{x3b4}}_2\right]$ . We add a condition to $\hat{\mathcal{A}}$ , stating that there exists an ${\unicode{x3b5}}_2>0$ such that for each $\boldsymbol{\unicode{x3b1}}$ , the sum ${\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i=\boldsymbol{\unicode{x3b1}} \right\}$ is at least $I{\unicode{x3b5}}_2$ . With a probability approaching one, this constraint is satisfied by the true latent attribute mastery patterns ${\mathcal{A}}^0$ . With this constraint, for any pair $\left(j,a\right)$ , the probability that $\left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}\right|$ exceeds $t$ can be bounded by $2\ \exp \left(-I{\unicode{x3b5}}_2{t}^2\right)$ using Hoeffding’s inequality. Thus, the probability that ${\max}_{j,a}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}\right|$ exceeds $t$ is less than $J{2}^{K+1}\exp \left(-I{\unicode{x3b5}}_2{t}^2\right)$ . Based on the scaling condition in Theorem 1, ${\max}_{j,a}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}\right|={o}_p(1)$ , implying that with probability converging to 1, all the ${\tilde{\unicode{x3b8}}}_{ja}$ values fall within $\left[{\unicode{x3b4}}_2/2,1-{\unicode{x3b4}}_2/2\right]$ . Based on this result, we assume that in the later content, the estimators $\left(\widehat{\mathcal{A}},\widehat{\Theta}\right)$ are obtained by minimizing the total loss (A.2), under the constraints that ${\min}_{\boldsymbol{\unicode{x3b1}}}\kern0.1em {\sum}_i\kern0.1em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i=\boldsymbol{\unicode{x3b1}} \right\}\ge I{\unicode{x3b5}}_2$ and $\widehat{\Theta}\subset \left[{\unicode{x3b4}}_3,1-{\unicode{x3b4}}_3\right]$ , for two small positive constants ${\unicode{x3b5}}_2,{\unicode{x3b4}}_3>0$ .

The third comment concerns how to quantify the effect of prior density ${f}_{ja}$ on the corresponding estimator ${\widehat{\unicode{x3b8}}}_{ja}$ . Actually, under the smoothness and shape constraints given in Assumption 5 and Assumption 3, the additional term $\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)$ might cause the estimator ${\hat{\unicode{x3b8}}}_{ja}$ to have a ${O}_p\left(1/\sqrt{I}\right)$ level drift from the sample average form ${\tilde{\unicode{x3b8}}}_{ja}$ given in (A.6). By considering the Taylor expansion formula, we have:

$$\begin{align*}\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)+\sum \limits_{z_{ij}=a}\kern0.20em \ell \left({X}_{ij},{\unicode{x3b8}}_{ja}\right)&=\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)+\sum \limits_{z_{ij}=a}\kern0.20em \ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)+\left(\sum \limits_{z_{ij}=a}\kern0.20em {\partial}_{\unicode{x3b8}}\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)\right)\left({\unicode{x3b8}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)\\&\quad +\frac{1}{2}{\partial}_{\unicode{x3b8}^2}\ell \left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\unicode{x3b8}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2, \\ &=\log {f}_{ja}\left({\unicode{x3b8}}_{ja}\right)+\sum \limits_{z_{ij}=a}\kern0.20em \ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right) +\frac{1}{2}\sum \limits_{z_{ij}=a}\kern0.20em {\partial}_{\unicode{x3b8}^2}\ell \left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\unicode{x3b8}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2,\end{align*}$$

where ${\grave{\unicode{x3b8}}}_{ja}$ is between ${\tilde{\unicode{x3b8}}}_{ja}$ and ${\unicode{x3b8}}_{ja}$ according to the mean value theorem; the second equality holds owing to Assumption 3. According to the above equation, we can find that ${\widehat{\unicode{x3b8}}}_{ja}=\arg {\min}_{\unicode{x3b8} \in \left[{\unicode{x3b4}}_3,1-{\unicode{x3b4}}_3\right]}\kern0.1em \log {f}_{ja}\left(\unicode{x3b8} \right)+{\sum}_{z_{ij}=a}\kern0.1em {\partial}_{\unicode{x3b8}^2}\ell \left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left(\unicode{x3b8} -{\tilde{\unicode{x3b8}}}_{ja}\right)}^2/2$ . Note that if we take $\unicode{x3b8} ={\tilde{\unicode{x3b8}}}_{ja},\log {f}_{ja}\left(\unicode{x3b8} \right)+{\sum}_{z_{ij}=a}\kern0.1em {\partial}_{\unicode{x3b8}^2}\ell \left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left(\unicode{x3b8} -{\tilde{\unicode{x3b8}}}_{ja}\right)}^2/2=\log {f}_{ja}\left({\tilde{\unicode{x3b8}}}_{ja}\right)$ . Based on Assumption 5, there exists a constant $C>0$ such that $\left|\log {f}_{ja}\left({\tilde{\unicode{x3b8}}}_{ja}\right)\right|\le {\sup}_{\unicode{x3b8} \in \left[{\unicode{x3b4}}_3,1-{\unicode{x3b4}}_3\right]}\kern0.1em \left|\log {f}_{ja}\left(\unicode{x3b8} \right)\right|<C$ , implying the following:

$$\begin{align*}\displaystyle 2C& \ge \frac{1}{2}\sum \limits_{z_{ij}=a}\kern0.20em {\partial}_{\unicode{x3b8}^2}\ell \left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\\ {}& \ge \frac{b_L}{2}\sum \limits_{z_{ij}=a}{\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2=\frac{b_L{i}_{ja}}{2}{\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2,\end{align*}$$

where ${i}_{ja}:= {\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^{\left(\mathcal{A}\right)}=a\right\}$ . Thus, a constant $\tilde{C}>0$ exists such that for any pair $\left(j,a\right)$ , we have:

$$\begin{align*}{\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\le \tilde{C}{i}_{ja}^{-1}.\end{align*}$$

This inequality will be used several times afterwards. In the theoretical analysis of the estimators above, uniform bounds related to the quantities of element-wise loss $\ell \left(\cdotp, \cdotp \right)$ and prior densities ${f}_{ja}$ are frequently used. The existence of these uniform bounds requires restricting the parameter space of the item response parameters to a compact subspace. Therefore, discussing the compact parameter subspace of the item parameters is necessary.

A.2 Outline of the first half of the proof

Step 1: Express the upper bound of $\mid \overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\mid$ in terms of $\left(b/2\right)\cdotp \left({\sum}_{j=1}^J\kern0.1em {\sum}_{a=1}^{L_j}\kern0.1em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right)+\mid \mathbb{E}\left[Y\right]-Y\mid +{O}_p(J)$ , where $Y:= {\sum}_i\kern0.1em {\sum}_j\kern0.1em \ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}^{\left(\mathcal{A}\right)}\right)$ depending on $Y$ and ${\overset{\leftharpoonup }{\Theta}}^{\left(\mathcal{A}\right)}$ under $\mathcal{A},b$ is the upper bound of the second-order derivative of the $\ell \left(\cdotp, \cdotp \right)$ .

Step 2: Bound ${\sum}_j\kern0.1em {\sum}_i\kern0.1em {i}_{ja}{\left({\hat{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2$ and $\mid Y-\mathbb{E}\left[Y\right]\mid$ separately to obtain a uniform convergence rate ${\sup}_{\mathcal{A}}\kern0.1em \mid \overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\mid ={o}_p\left({\unicode{x3b4}}_{IJ}\right)$

Step 3: Based on the definition of $\hat{\mathcal{A}}$ , it follows that $0\le \overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)\le 2{\sup}_{\mathcal{A}}\kern0.1em \mid \overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\mid ={o}_p\left({\unicode{x3b4}}_{IJ}\right)$ , which controls the deviation $\overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)$ .

In some classical statistical inference contexts, consistent results for the parameters of interest are typically established through the uniform convergence of random functions associated with these parameters. For instance, if ${\sup}_{\boldsymbol{\unicode{x3b8}} \in \Theta}\kern0.1em \mid \widehat{\ell}\left(\boldsymbol{\unicode{x3b8}} \right)-\ell \left(\boldsymbol{\unicode{x3b8}} \right)\mid \overset{P}{\to }0$ , and if we further assume that $\ell$ has a unique minimum $\tilde{\boldsymbol{\unicode{x3b8}}}$ on $\Theta, \arg {\min}_{\Theta}\kern0.1em \widehat{\ell}\left(\boldsymbol{\unicode{x3b8}} \right)\mathbf{=:}\widehat{\boldsymbol{\unicode{x3b8}}}\overset{P}{\to}\tilde{\boldsymbol{\unicode{x3b8}}}$ under some regularity conditions. The regular conditions may vary across settings. Considering $\mathcal{A}$ as the parameter to be estimated, the primary aim of the first three steps is to demonstrate that $\mathcal{A}$ minimizes the expected loss and establishes a uniform convergence result for its random loss function of $\mathcal{A}$ .

A.3 Outline of the second half of the proof

Step 4: Define ${I}_{a,b}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0=a\right\}\mathcal{I}\left\{{\widehat{z}}_{ij}=b\right\},a,b\in \left[{L}_j\right]$ to represent the samples with the wrong local latent class assignments. Derive some upper bounds for the quantities based on ${I}_{a,b}^j$ using $\overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)$ with the help of the identification assumptions.

Step 5: Bound the ${\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{\boldsymbol{\unicode{x3b1}}}}_i\ne {\boldsymbol{\unicode{x3b1}}}^0\right\}$ using the quantities based on ${I}_{a,b}^j$ with the help of the discrete structure of the Q-matrix, then obtain the desired classification error rate.

Assumptions 1–5 are the regularity conditions for achieving clustering consistency based on the uniform convergence results established in the first half of the proof. We have provided further details regarding the assumptions in later proofs.

A.4 First Half of the Proof of Theorem 1

Step 1. The idea of decomposing $\overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)$ is to consider:

$$\begin{align*}\ell \left({X}_{ij},{\hat{\unicode{x3b8}}}_{j,{z}_i}\right)&-\mathbb{E}\left[\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)\right]=\left(\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\\&\ +\left(\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)-\mathbb{E}\left[\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)\right]\right).\end{align*}$$

The variability in the first term of the right-hand side mainly emerges from the fluctuation in $\left|{\widehat{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right|$ , while the randomness in the second term is attributable to the stochastic nature of ${X}_{ij}$ .

Lemma 1. Let $\left({X}_{ij};1\le i\le I,1\le j\le J\right)$ denote independent Bernoulli trials with parameters $\left({P}_{ij};1\le i\le I,1\le j\le J\right)$ . In a general latent class model, given arbitrary latent attribute mastery patterns $\mathcal{A}$ ,

(A.6)

$$\begin{align}\left|\overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\right|\le \frac{b}{2}\cdotp \left(\sum\limits_{j=1}^J\sum\limits_{a=1}^{L_j}{i}_{j_a}{\left({\widehat{\unicode{x3b8}}}_{ja}-{\overline{\unicode{x3b8}}}_{ja}\right)}^2\right)+\left|Y-\mathbb{E}(Y)\right|+{O}_{p(J)},\end{align}$$

where $Y={\sum}_{j=1}^J\kern0.1em {\sum}_{i=1}^I\kern0.1em \ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)$ is a random variable depending on $\mathcal{A}$ and ${L}_j$ denotes the number of the distinct local latent classes induced by ${\boldsymbol{q}}_j$ for item $j$ .

Proof. By noting the decomposition that we mentioned at the beginning of Step 1, $\mid Y-\mathbb{E}\left[Y\right]\mid$ is easy to check. It is sufficient for us to prove that

$$\begin{align*}0\le \sum \limits_i\kern0.1em \sum \limits_j\kern0.1em \left(\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\le \frac{b}{2}\cdotp \left(\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right)+{O}_p(J).\end{align*}$$

The first inequality is clear by the definition of ${\tilde {\unicode{x3b8}}}_{j,{z}_i}$ (minimizing the loss). For the second part, using the mean value theorem for second-order derivatives, we obtain

(A.7)

$$\begin{align}\displaystyle &\;\sum \limits_i\kern0.20em \sum \limits_j\kern0.20em \left(\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\nonumber\\ &\quad{}=\;\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)\right)\nonumber\\ &\quad{}=\;\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left({\partial}_{\unicode{x3b8}}\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)\left({\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)+\frac{1}{2}{\partial}_{\unicode{x3b8}^2}\left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\tilde{\unicode{x3b8}}}_{ja}-{\widehat{\unicode{x3b8}}}_{ja}\right)}^2\right)\\ &\quad{}=\;\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\frac{1}{2}{\partial}_{\unicode{x3b8}^2}\left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\right)\nonumber\\ &\quad{}\le \;\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\frac{b_U}{2}{\left({\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\right)=\frac{b_U}{2}\left(\sum \limits_{j=1}^{J} \sum \limits_{a=1}^{L_j} i_{ja}(\bar{\theta}_{ja}-\tilde{\theta}_{ja})^2\right) ,\nonumber\end{align}$$

where ${\grave{\unicode{x3b8}}}_{ja}$ is between ${\tilde{\unicode{x3b8}}}_{ja}$ and ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}$ according to the mean value theorem. The third equality holds true since by Assumption 3, we have

$$\begin{align*}\sum \limits_{z_i=a}\kern0.1em {\partial}_{\unicode{x3b8}}\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)=0. \end{align*}$$

Similarly, using ${\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\le \tilde{C}{i}_{ja}^{-1}$ , we have:

$$\begin{align*}\displaystyle &\;\sum \limits_i\kern0.20em \sum \limits_j\kern0.20em \left(\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\\ {}&\quad =\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{ja}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)\right)\\ {}&\quad =\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left({\partial}_{\theta}\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{ja}\right)\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)+\frac{1}{2}{\partial}_{\theta^2}\left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\tilde{\unicode{x3b8}}}_{ja}-{\widehat{\unicode{x3b8}}}_{ja}\right)}^2\right)\\ {}&\quad =\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\frac{1}{2}{\partial}_{\unicode{x3b8}^2}\left({X}_{ij},{\grave{\unicode{x3b8}}}_{ja}\right){\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\right)\\ {}&\quad \le \sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \sum \limits_{z_i=a}\kern0.20em \left(\frac{b_U}{2}{\left({\widehat{\unicode{x3b8}}}_{ja}-{\tilde{\unicode{x3b8}}}_{ja}\right)}^2\right)=\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em \frac{b_U\tilde{C}}{2}\le \left({b}_U\tilde{C}{2}^K\right)J,\end{align*}$$

which concludes the proof of this lemma.

Lemma 2. The following event happens with a probability of at least $1-\unicode{x3b4}$ ,

$$\begin{align*}\underset{\mathcal{A}}{\max}\kern0.1em \left\{\sum \limits_{j=1}^J\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right\}<\frac{1}{2}\left(I\log {2}^K+J{2}^K\log \left(\frac{I}{2^K}+1\right)-\log \unicode{x3b4} \right).\end{align*}$$

Proof. Under any realization of $\mathcal{A}$ , each ${\tilde{\unicode{x3b8}}}_{ja}$ is an average of ${i}_{ja}$ independent Bernoulli random variables ${x}_{1j},\dots, {x}_{Ij}$ with mean ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}$ . By applying the Hoeffding inequality, we have

(A.8)

$$\begin{align}P\left({\tilde{\unicode{x3b8}}}_{ja}\ge {\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}+t\right)\le \exp \left(-2{i}_{ja}{t}^2\right),P\left({\tilde{\unicode{x3b8}}}_{ja}\le {\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}-t\right)\le \exp \left(-2{i}_{ja}{t}^2\right).\end{align}$$

Notably, considering a fixed $\mathcal{A}$ , each ${\tilde{\unicode{x3b8}}}_{ja}$ can take values only in the finite set $\left\{0,1/{i}_{ja},2/{i}_{ja},\dots, 1\right\}$ of cardinality ${i}_{ja}+1$ . We denote this range of ${\tilde{\unicode{x3b8}}}_{ja}$ by ${\tilde{\Theta}}^{ja}$ and the range of the matrix $\tilde{\Theta}=\left({\tilde{\unicode{x3b8}}}_{ja}\right)$ by $\tilde{\Theta}$ . Subsequently, $P\left({\tilde{\unicode{x3b8}}}_{ja}=v\right)\le \exp \left(-2{i}_{ja}{\left(v-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right)$ for any $v\in {\tilde{\Theta}}^{j,a}$ . As each of the $J\times {2}^K$ entries in $\tilde{\Theta}$ , ${\tilde{\unicode{x3b8}}}_{ja}$ can independently take on ${i}_{ja}+1$ different values, there is ${\mid\tilde{\Theta}\mid ={\prod}_j\kern0.1em {\prod}_{a=1}^{L_j}\kern0.1em \left({i}_{ja}+1\right)}$ with constraint ${\sum}_{a=1}^{L_j}\kern0.1em {i}_{ja}=I$ . As ${L}_j={2}^{K_j}\le {2}^K$ , we have ${\prod}_{a=1}^{L_j}\kern0.1em \left({i}_{ja}+1\right)\le {\left(1+I/{2}^K\right)}^{2^K}$ . Denote ${\tilde{\Theta}}_{\epsilon}=\left\{\grave{\Theta}\in \tilde{\Theta}:{\sum}_j\kern0.1em {\sum}_a\kern0.1em {i}_{ja}{\left({\grave{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\ge \epsilon \right\},{\tilde{\Theta}}_{\epsilon}\subseteq \tilde{\Theta}$ , and

(A.9)

$$\begin{align}\displaystyle & P\left(\sum \limits_j\kern0.20em \sum \limits_{a=1}^{L_j}\kern0.20em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\ge \epsilon \right)=\sum \limits_{\grave{\Theta}\in {\tilde{\Theta}}_{\epsilon }}\kern0.20em P\left(\tilde{\Theta}=\grave{\Theta}\right)\nonumber\\&\quad{}\le \;\sum \limits_{\grave{\Theta}\in {\tilde{\Theta}}_{\epsilon }}\kern0.20em \prod \limits_j\kern0.20em \prod \limits_a\kern0.20em \exp \left(-2{i}_{ja}{\left({\grave{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right)\nonumber\\&\quad{}=\;\sum \limits_{\grave{\Theta}\in {\tilde{\Theta}}_{\epsilon }}\kern0.20em \exp \left(-2{i}_{ja}\sum \limits_j\kern0.20em \sum \limits_a\kern0.20em {\left({\grave{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right)\\&\le \sum \limits_{\grave{\Theta}\in {\tilde{\Theta}}_{\unicode{x3b5}}}\kern0.20em \exp \left(-2\epsilon \right)\le \left|\tilde{\Theta}\right|{e}^{-2\unicode{x3b5}} \leq \left(\frac{I}{2^K}+1\right)^{J2^K}e^{-2\epsilon}.\nonumber\end{align}$$

The above result holds for any fixed $\mathcal{A}$ when we apply a union bound over all the ${\left({2}^K\right)}^I$ possible assignments of $\mathcal{A}$ to obtain

(A.10)

$$\begin{align}P\left(\underset{\mathcal{A}}{\max}\kern0.1em \left\{\sum \limits_j\kern0.20em \sum \limits_a\kern0.20em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\right\}\ge \unicode{x3b5} \right)\le {2}^{KI}{\left(\frac{I}{2^K}+1\right)}^{J{2}^K}{e}^{-2\unicode{x3b5}}.\end{align}$$

Take $\unicode{x3b4} ={2}^{KI}{\left(\frac{I}{2^K}+1\right)}^{J{2}^K}{e}^{-2\unicode{x3b5}}$ ; then, $2\epsilon =I\log {2}^K+J{2}^K\log \left(1+I/{2}^K\right)-\log \unicode{x3b4}$ . This concludes the proof of lemma 2.

Lemma 3. Define the random variable $Y={\sum}_i\kern0.1em {\sum}_j\kern0.1em \ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}^{\left(\mathcal{A}\right)}\right)$ , and denote ${Y}_{ij}=\ell \left({X}_{ij},{\overset{\leftharpoonup }{\theta}}_{j,{z}_i}\right)$ . Note that ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\in \left[{\unicode{x3b4}}_2,1-{\unicode{x3b4}}_2\right]$ and $\ell \left(\cdotp, \cdotp \right)$ are continuous on $\theta$ in $\left(0,1\right)$ . Since continuous functions on the compact set are bounded, a constant $U>0$ exists such that $\left|\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)\right|\le U,\forall \left(i,j\right)$ . By applying Hoeffding’s inequality to bound $\mid Y-\mathbb{E}\left[Y\right]\mid$ for any realization of $\mathcal{A}$ , we have:

(A.11)

$$\begin{align}P\left(|Y-\mathbb{E}\left[Y\right]|\ge \unicode{x3b5} \right)\le 2\exp \left\{-\frac{\epsilon^2}{\left(4{U}^2\right) IJ}\right\}.\end{align}$$

With the help of Lemma 2 and Lemma 3, subsequently, we prove the following proposition:

Proposition 1. Under the following scaling for some small positive constant $c>0$ ,

$$\begin{align*}\sqrt{J}=O\left({I}^{1-c}\right)\end{align*}$$

we have ${\mathit{max}}_{\mathcal{A}}\kern0.1em \mid \overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\mid ={o}_p\left({\unicode{x3b4}}_{IJ}\right)$ where ${\delta}_{IJ}=I\sqrt{J}{\left(\log J\right)}^{\tilde{\unicode{x3b5}}}$ for a small positive $\tilde{\unicode{x3b5}}>0$ .

Proof. First, note that under the given scaling condition, $J=o\left(I\sqrt{J}\right)$ . Combining the results of Lemma 2 and Lemma 3, since there are ${\left({2}^K\right)}^I$ possible assignments of $\mathcal{A}$ , we apply the union bound to obtain

(A.12)

$$\begin{align}P\left(\underset{\mathcal{A}}{\max}\kern0.1em \left|\overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\right|\ge 3\unicode{x3b5} {\unicode{x3b4}}_{IJ}\right)&\le {\left({2}^K\right)}^IP\left[\left\{\sum \limits_j\kern0.20em \sum \limits_a\kern0.20em {i}_{ja}{\left({\tilde{\unicode{x3b8}}}_{ja}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{ja}\right)}^2\ge \unicode{x3b5} {\unicode{x3b4}}_{IJ}\right\}\cup \left\{\left|Y-\mathbb{E}\left[Y\right]\right|\ge \unicode{x3b5} {\unicode{x3b4}}_{IJ}\right\}\right]\nonumber\\&\quad+P\left(\underset{\mathcal{A}}{\max}\kern0.1em \sum \limits_i\kern0.20em \sum \limits_j\kern0.20em \left(\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\ge J{\left(\log \right)}^{\tilde{\unicode{x3b5}}}\right)\nonumber\\&\le \exp \left(I\log \left({2}^K\right)+J{2}^K\log \left(\frac{I}{2^K}+1\right)-2\unicode{x3b5} {\unicode{x3b4}}_{IJ}\right)\nonumber\\&\quad+2\exp \left(I\log \left({2}^K\right)-\frac{\unicode{x3b5}^2{\delta}_{IJ}^2}{4{U}^2 IJ}\right)\nonumber\\&\quad+P\left(\underset{\mathcal{A}}{\max}\kern0.1em \sum \limits_i\kern0.20em \sum \limits_j\kern0.20em \left(\ell \left({X}_{ij},{\widehat{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\ge J{\left(\log \right)}^{\tilde{\unicode{x3b5}}}\right).\end{align}$$

For the third term, note that the following inequality holds for any given $\mathcal{A}$ :

$$\begin{align*}\sum \limits_i\kern0.1em \sum \limits_j\kern0.1em \left(\ell \left({X}_{ij},{\hat{\unicode{x3b8}}}_{j,{z}_i}\right)-\ell \left({X}_{ij},{\tilde{\unicode{x3b8}}}_{j,{z}_i}\right)\right)\le \left({b}_U\tilde{C}{2}^K\right)J\end{align*}$$

For the second term on the right-hand side of the aforementioned display to converge to zero, we set ${\unicode{x3b4}}_{IJ}=I\sqrt{J}{\left(\log J\right)}^{\tilde{\epsilon }}$ for a small positive constant $\tilde{\unicode{x3b5}}$ . Moreover, under this ${\unicode{x3b4}}_{IJ}$ , for the first term to converge to zero as $I,J$ increase, the scaling $\sqrt{J}=O\left({I}^{1-c}\right)$ given in the theorem results in $P\left({\max}_{\mathcal{A}}\kern0.1em |\overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)|\ge \unicode{x3b5} {\unicode{x3b4}}_{IJ}\right)=o(1)$ , which implies the result in Proposition 1.

Step 3. (A.5) implies that ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}^{\left({\mathcal{A}}^0\right)}={P}_{ij}$ , which means that if we plug into the true latent attribute mastery pattern ${\mathcal{A}}^0$ , the estimators will be the corresponding true parameters. According to this property, the following lemma indicates that ${\mathcal{A}}^0$ minimizes expected loss.

Lemma 4. By Assumption 4, $\mathbb{E}\left[\ell \left({X}_{ij},{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)\right]-\mathbb{E}\left[\ell \left({X}_{ij},{\unicode{x3b8}}_{j,{z}_i^0}^0\right)\right]\ge c{\left({\overset{\leftharpoonup }{\theta}}_{j,{z}_i}-{\unicode{x3b8}}_{j,{z}_i^0}^0\right)}^{\unicode{x3b7}}$ for some $\unicode{x3b7} \ge 2,c>0$ , then we have

(A.13)

$$\begin{align}\overline{\mathcal{L}}\left(\mathcal{A}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)\ge c\cdotp \left(\sum \limits_i\kern0.20em \sum \limits_j\kern0.20em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)}^{\unicode{x3b7}}\right)\ge 0.\end{align}$$

Notably, while Lemma 4 holds for any $\mathcal{A}$ , it also holds for the estimator $\widehat{\mathcal{A}}$ , then

(A.14)

$$\begin{align}0\le \overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)=\left[\overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\mathcal{L}\left(\widehat{\mathcal{A}}\right)\right]+\left[\mathcal{L}\left(\widehat{\mathcal{A}}\right)-\mathcal{L}\left({\mathcal{A}}^0\right)\right]+\left[\mathcal{L}\left({\mathcal{A}}^0\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)\right].\end{align}$$

As $\widehat{\mathcal{A}}=\arg {\min}_{\mathcal{A}}\kern0.1em \mathcal{L}\left(\mathcal{A}\right)$ , we have $\mathcal{L}\left(\widehat{\mathcal{A}}\right)-\mathcal{L}\left({\mathcal{A}}^0\right)\le 0$ . Substituting this into A.14, we can derive that

$$\begin{align*}0\le \overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)\le 2\underset{\mathcal{A}}{\sup}\kern0.1em \mid \overline{\mathcal{L}}\left(\mathcal{A}\right)-\mathcal{L}\left(\mathcal{A}\right)\mid ={o}_p\left({\unicode{x3b4}}_{IJ}\right)\end{align*}$$

A.5 Second Half of the Proof of Theorem 1

By applying Hölder’s inequality, we have

$$\begin{align*}{(IJ)}^{1-\frac{\unicode{x3b7}}{2}}{\left(\sum \limits_i\kern0.20em \sum \limits_j\kern0.20em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)}^2\right)}^{\frac{\unicode{x3b7}}{2}}\le \sum \limits_i\kern0.1em \sum \limits_j\kern0.1em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)}^{\unicode{x3b7}}={o}_p\left({\unicode{x3b4}}_{IJ}\right).\end{align*}$$

By letting ${(IJ)}^{1-\eta /2}{(S)}^{\unicode{x3b7} /2}={\unicode{x3b4}}_{IJ}$ , we can check that ${\sum}_i\kern0.1em {\sum}_j\kern0.1em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,\widehat{z_i}}\right)}^2={o}_p(S)$ where $S:= I{(J)}^{1-1/\eta }{\left(\log J\right)}^{2\tilde{\unicode{x3b5}}/\unicode{x3b7}}$ . In the following, we derive a lower bound for ${\sum}_i\kern0.1em {\sum}_j\kern0.1em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)}^2$ because it is easier to work with than ${\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{z}_i}\right)}^{\unicode{x3b7}}$ .

Step 4. Motivated by Assumption 2, we define $\mathcal{J}:= \left\{j\in \left[J\right];\exists k\in \left[K\right]\right.$ s.t. $\left.{\boldsymbol{q}}_j^0={\boldsymbol{e}}_k\right\}$ , which represents the set of all items $j$ that depend on only one latent attribute. Notably, $\forall j\in \mathcal{J},\left|\left\{\boldsymbol{\unicode{x3b1}} \circ {\boldsymbol{q}}_j^0;\boldsymbol{\unicode{x3b1}} \in {\left\{0,1\right\}}^K\right\}\right|=2$ , as ${\boldsymbol{q}}_j$ only contains one required latent attribute, then $\unicode{x3be} \left({\boldsymbol{q}}_j^0,\boldsymbol{\unicode{x3b1}} \right)\in \left\{1,2\right\}$ for all $j\in \mathcal{J}$ . Without loss of generality, we assume that if $\boldsymbol{\unicode{x3b1}} \circ {\boldsymbol{q}}_j^0\ne \mathbf{0}$ , then let $\unicode{x3be} \left({\boldsymbol{q}}_j^0,\boldsymbol{\unicode{x3b1}} \right)=2$ , otherwise, let $\unicode{x3be} \left({\boldsymbol{q}}_j^0,\boldsymbol{\unicode{x3b1}} \right)=1$ . Further, we assume that ${\unicode{x3b8}}_{j,2}^0>{\unicode{x3b8}}_{j,1}^0,\forall j\in \mathcal{J}$ , which aligns with the concept that subjects possessing the required latent attribute tend to perform better. For any $j\in \mathcal{J}$ , define

(A.15)

$$\begin{align}{I}_{a,b}^j:= \sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}\mathcal{I}\left\{{\widehat{z}}_{ij}=b\right\},\left(a,b\right)\in {\left\{1,2\right\}}^2.\end{align}$$

Note ${P}_{ij}=\mathcal{I}\left\{{z}_{ij}^0=2\right\}{\unicode{x3b8}}_{j,2}^0+\mathcal{I}\left\{{z}_{ij}^0=1\right\}{\unicode{x3b8}}_{j,1}^0$ and ${I}_{2,2}^j+{I}_{1,2}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{z}}_{ij}=2\right\}$ , ${I}_{2,1}^j+{I}_{1,1}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\widehat{z}}_{ij}=1\right\}$ . By using (A.5), there are

(A.16)

$$\begin{align}{\overset{\leftharpoonup }{\theta}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}&=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=2\right\}{P}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=2\right\}}=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=2\right\}\left(\mathcal{I}\left\{{z}_{ij}^0=2\right\}{\unicode{x3b8}}_{j,2}^0+\mathcal{I}\left\{{z}_{ij}^0=1\right\}{\unicode{x3b8}}_{j,1}^0\right)}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=2\right\}}\nonumber\\&=\frac{I_{2,2}^j{\unicode{x3b8}}_{j,2}^0+{I}_{1,2}^j{\unicode{x3b8}}_{j,1}^0}{I_{2,2}^j+{I}_{1,2}^j};{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}=\frac{I_{2,1}^j{\unicode{x3b8}}_{j,2}^0+{I}_{1,1}^j{\unicode{x3b8}}_{j,1}^0}{I_{2,1}^j+{I}_{1,1}^j}.\end{align}$$

Under $\widehat{\mathcal{A}}$ , we impose a natural constraint ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}>{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)},\forall j\in \mathcal{J}$ on $\widehat{\mathcal{A}}$ for identifiability purpose. This constraint does not change the previous results as ${\unicode{x3b8}}_{j,2}^0>{\unicode{x3b8}}_{j,1}^0$ allows $\mathcal{L}\left(\widehat{\mathcal{A}}\right)-\mathcal{L}\left({\mathcal{A}}^0\right)\le 0$ in (A.14) still holds; thus, $\overline{\mathcal{L}}\left(\widehat{\mathcal{A}}\right)-\overline{\mathcal{L}}\left({\mathcal{A}}^0\right)={o}_p\left({\unicode{x3b4}}_{IJ}\right)$ still holds under this constraint. Combining ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}>{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}$ and ${\unicode{x3b8}}_{j,2}^0>{\unicode{x3b8}}_{j,1}^0$ , there is

(A.17)

$$\begin{align}{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}>{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}\Longleftrightarrow \left({I}_{2,2}^j{I}_{1,1}^j-{I}_{1,0}^j{I}_{0,1}^j\right){\unicode{x3b8}}_{j,2}^0>\left({I}_{2,2}^j{I}_{1,1}^j-{I}_{1,0}^j{I}_{0,1}^j\right){\unicode{x3b8}}_{j,1}^0\Longleftrightarrow {I}_{2,2}^j{I}_{1,1}^j>{I}_{2,1}^j{I}_{1,2}^j.\end{align}$$

From (A.15), we can obtain

$$\begin{align*}\left|{\unicode{x3b8}}_{j,1}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}\right|=\frac{I_{2,1}^j\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}{I_{2,1}^j+{I}_{1,1}^j},\left|{\unicode{x3b8}}_{j,2}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}\right|=\frac{I_{1,2}^j\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}{I_{2,2}^j+{I}_{1,2}^j},\end{align*}$$

$$\begin{align*}\left|{\unicode{x3b8}}_{j,2}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}\right|=\frac{I_{1,1}^j\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}{I_{2,1}^j+{I}_{1,1}^j},\left|{\theta}_{j,1}^0-{\overset{\leftharpoonup }{\theta}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}\right|=\frac{I_{2,2}^j\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}{I_{2,2}^j+{I}_{1,2}^j}.\end{align*}$$

Therefore,

(A.18)

$$\begin{align}&\;\sum \limits_{j=1}^J\kern0.20em \sum \limits_{i=1}^I\kern0.20em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,\widehat{z_i}}\right)}^2\ge \sum \limits_{j\in \mathcal{J}}\kern0.20em \sum \limits_{i=1}^I\kern0.20em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,\widehat{z_i}}\right)}^2\nonumber\\ {}&\quad=\;\sum \limits_{j\in \mathcal{J}}\kern0.20em \left({I}_{1,1}^j{\left({\unicode{x3b8}}_{j,1}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}\right)}^2+{I}_{2,1}^j{\left({\unicode{x3b8}}_{j,2}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\widehat{\mathcal{A}}\right)}\right)}^2+{I}_{1,2}^j{\left({\unicode{x3b8}}_{j,1}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}\right)}^2+{I}_{2,2}^j{\left({\unicode{x3b8}}_{j,2}^0-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\widehat{\mathcal{A}}\right)}\right)}^2\right)\nonumber\\ {}&\quad=\;\sum \limits_{j\in \mathcal{J}}\kern0.20em \left(\frac{I_{1,1}^j{\left({I}_{2,1}\right)}^2+{I}_{2,1}^j{\left({I}_{1,1}\right)}^2}{{\left({I}_{2,1}^j+{I}_{1,1}^j\right)}^2}+\frac{I_{1,2}^j{\left({I}_{2,2}\right)}^2+{I}_{2,2}^j{\left({I}_{1,2}\right)}^2}{{\left({I}_{2,2}^j+{I}_{1,2}^j\right)}^2}\right){\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}^2\nonumber\\ {}&\quad=\;\sum \limits_{j\in \mathcal{J}}\kern0.20em \left(\frac{I_{2,1}^j{I}_{1,1}^j}{I_{2,1}^j+{I}_{1,1}^j}+\frac{I_{2,2}^j{I}_{1,2}^j}{I_{2,2}^j+{I}_{1,2}^j}\right){\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}^2\nonumber\\ {}&\quad\ge \unicode{x3b4} \sum \limits_{j\in \mathcal{J}}\kern0.20em \left(\frac{I_{2,1}^j{I}_{1,1}^j}{I_{2,1}^j+{I}_{1,1}^j}+\frac{I_{2,2}^j{I}_{1,2}^j}{I_{2,2}^j+{I}_{1,2}^j}\right)\nonumber\\&\quad\ge \frac{1}{2}\unicode{x3b4} \sum \limits_{j\in \mathcal{J}}\kern0.20em \left(\min \left\{{I}_{2,1}^j,{I}_{1,1}^j\right\}+\min \left\{{I}_{2,2}^j,{I}_{1,2}^j\right\}\right).\end{align}$$

The second inequality holds since by Assumption 1, ${\left({\unicode{x3b8}}_{j,2}^0-{\unicode{x3b8}}_{j,1}^0\right)}^2\ge \unicode{x3b4}$ . One ideal scenario is that for most $j\in \mathcal{J},\min \left\{{I}_{2,1}^j,{I}_{1,1}^j\right\}+\min \left\{{I}_{2,2}^j,{I}_{1,2}^j\right\}={I}_{2,1}^j+{I}_{1,2}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0\ne {\widehat{z}}_{ij}\right\}$ ; thus the misclassification error for the local latent classes could be bounded relatively tight. The following result confirms this intuition.

Lemma 5. Define the following random set depending on the estimated latent attribute mastery patterns $\hat{\mathcal{A}}$ under constraint ${\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,2}^{\left(\mathcal{A}\right)}>{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,1}^{\left(\mathcal{A}\right)},\forall j\in \mathcal{J}$ :

$$\begin{align*}{\mathcal{J}}_0=\left\{j\in \mathcal{J};{I}_{2,1}^j<{I}_{1,1}^j,{I}_{1,2}^j<{I}_{2,2}^j\right\};\end{align*}$$

$$\begin{align*}{\mathcal{J}}_1=\left\{j\in \mathcal{J};{I}_{2,1}^j<{I}_{1,1}^j,{I}_{1,2}^j>{I}_{2,2}^j\right\};\end{align*}$$

$$\begin{align*}{\mathcal{J}}_2=\left\{j\in \mathcal{J};{I}_{2,1}^j>{I}_{1,1}^j,{I}_{1,2}^j<{I}_{2,2}^j\right\},\end{align*}$$

then under Assumption 1 and Assumption 2, there are $\left|{\mathcal{J}}_1\right|={o}_p\left(S/I\right),\left|{\mathcal{J}}_2\right|={o}_p\left(S/I\right)$ .

Proof. If $j\in {\mathcal{J}}_1,\min \left\{{I}_{2,1}^j,{I}_{1,1}^j\right\}+\min \left\{{I}_{2,2}^j,{I}_{1,2}^j\right\}={I}_{2,1}^j+{I}_{2,2}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0=2\right\}$ . Under Assumption 2,

$$\begin{align*}\sum \limits_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0=2\right\}\ge I\epsilon\end{align*}$$

then

$$\begin{align*}& P\left(\left|{\mathcal{J}}_1\right|\ge \frac{S}{\unicode{x3b4} I}\right)\\ {}&\quad \le P\left(\sum \limits_{j\in {\mathcal{J}}_1}\kern0.20em {I}_{2,1}^j+{I}_{2,2}^j\ge \frac{S}{\unicode{x3b4} I}\cdotp I\unicode{x3b5} \right)\\{}&\quad \le P\left(\sum \limits_i\kern0.20em \sum \limits_j\kern0.20em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,{\widehat{z}}_i}\right)}^2\ge \frac{\unicode{x3b5} S}{2}\right).\\[-24pt]\end{align*}$$

By noting ${\sum}_i\kern0.1em {\sum}_j\kern0.1em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,\widehat{z_i}}\right)}^2={o}_p(S)$ , then $\left|{\mathcal{J}}_1\right|={o}_p\left(S/I\right)$ . Similar arguments yield $\left|{\mathcal{J}}_2\right|={o}_p\left(S/I\right)$ , which concludes the proof of Lemma 5.

Note (A.17) implies that $\min \left\{{I}_{2,1}^j,{I}_{1,1}^j\right\}+\min \left\{{I}_{2,2}^j,{I}_{1,2}^j\right\}\ne {I}_{1,1}^j+{I}_{2,2}^j,\forall j\in \mathcal{J}$ , thus $\mathcal{J}={\mathcal{J}}_0\cup {\mathcal{J}}_1\cup {\mathcal{J}}_2$ . Lemma 5 implies that when ${\unicode{x3b4}}_J$ goes to 0 with a mild rate, the number of elements in ${\mathcal{J}}_0$ dominates the number of elements in ${\mathcal{J}}_1\cup {\mathcal{J}}_2$ ; thus for most $j\in \mathcal{J},\min \left\{{I}_{2,1}^j,{I}_{1,1}^j\right\}+\min \left\{{I}_{2,2}^j,{I}_{1,2}^j\right\}$ should be ${I}_{2,1}^j+{I}_{1,2}^j={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0\ne {\widehat{z}}_{ij}\right\}$ , which represents the number of subjects with the incorrectly assigned local latent classes.

Step 5. (A.18) implies that ${\sum}_i\kern0.1em {\sum}_j\kern0.1em {\left({P}_{ij}-{\overset{\leftharpoonup }{\unicode{x3b8}}}_{j,\widehat{z_i}}\right)}^2\ge \unicode{x3b4} {\sum}_{j\in {\mathcal{J}}_0}\kern0.1em {\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0\ne {\widehat{z}}_{ij}\right\}/2$ . Next we focus on obtaining a lower bound of ${\sum}_{j\in {\mathcal{J}}_0}\kern0.1em {\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0\ne {\widehat{z}}_{ij}\right\}$ to control the classification error rate ${I}^{-1}{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ .

Motivated by Assumption 2, for each latent attribute $k$ , denote ${j}_k^1$ as the smallest integer $j$ such that item $j$ has a $\boldsymbol{q}$ -vector ${\boldsymbol{e}}_k$ , and denote ${j}_k^2$ as the second smallest integer $j$ such that ${\boldsymbol{q}}_j={\boldsymbol{e}}_k$ , etc. For positive integer $m$ , denote

(A.19)

$$\begin{align}{\mathcal{B}}^m=\left\{{j}_1^m,\dots, {j}_K^m\right\}.\end{align}$$

For each $k\in \left\{1,\dots, K\right\}$ , denote

(A.20)

$$\begin{align}{J}_{\mathrm{min}}=\underset{1\le k\le K}{\min}\kern0.1em \left|\left\{j\in {\mathcal{J}}_0;{\boldsymbol{q}}_j^0={\boldsymbol{e}}_k\right\}\right|,{\tilde{J}}_{\mathrm{min}}=\underset{1\le k\le K}{\min}\kern0.1em \left|\left\{j\in \mathcal{J};{\boldsymbol{q}}_j^0={\boldsymbol{e}}_k\right\}\right|.\end{align}$$

Then, we have that ${\mathcal{B}}^m\cap {\mathcal{B}}^l=\varnothing$ for any $m\ne l$ , thus

(A.21)

$$\begin{align}\displaystyle &\;\sum \limits_{i=1}^I\kern0.20em \sum \limits_{j\in {\mathcal{J}}_0}\kern0.20em \mathcal{I}\left\{\unicode{x3be} \left({\boldsymbol{q}}_j^0,{\boldsymbol{\alpha}}_i^0\right)\ne \unicode{x3be} \left({\boldsymbol{q}}_j^0,{\widehat{\boldsymbol{\alpha}}}_i\right)\right\}\nonumber\\ {}&\quad\ge \;\sum \limits_{i=1}^I\kern0.20em \sum \limits_{m=1}^{J_{\mathrm{min}}}\kern0.20em \sum \limits_{j\in {\mathcal{B}}^m}\kern0.20em \mathcal{I}\left\{\unicode{x3be} \left({\boldsymbol{q}}_j^0,{\boldsymbol{\alpha}}_i^0\right)\ne \unicode{x3be} \left({\boldsymbol{q}}_j^0,{\widehat{\boldsymbol{\alpha}}}_i\right)\right\}\nonumber\\ {}&\quad= {J}_{\mathrm{min}}\sum \limits_{i=1}^I\kern0.20em \sum \limits_{k=1}^K\kern0.20em \mathcal{I}\left\{\unicode{x3be} \left({\boldsymbol{e}}_k,{\boldsymbol{\alpha}}_i^0\right)\ne \unicode{x3be} \left({\boldsymbol{e}}_k,{\widehat{\boldsymbol{\alpha}}}_i\right)\right\}.\end{align}$$

The last inequality holds since ${\sum}_{k=1}^K\kern0.1em \mathcal{I}\left\{\unicode{x3be} \left({\boldsymbol{e}}_k,{\boldsymbol{\unicode{x3b1}}}_i^0\right)\ne \xi \left({\boldsymbol{e}}_k,{\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right)\right\}\ge \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ . Note (A.21) implies ${o}_p\left(S/I\right)\ge {J}_{\mathrm{min}}{I}^{-1}{\sum}_{i=1}^I \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ . For simplicity,

$$\begin{align*}{\unicode{x3b3}}_J=\frac{S}{IJ}={J}^{-\frac{1}{\unicode{x3b7}}}{\left(\log J\right)}^{\frac{\tilde{\unicode{x3b5}}}{\unicode{x3b7}}}.\end{align*}$$

Note that (32) in Assumption 2 implies that $\mid \mathcal{J}\mid /J\ge {\tilde{J}}_{\mathrm{min}}/J\ge {\delta}_J$ and ${J}_{\mathrm{min}}\ge {\tilde{J}}_{\mathrm{min}}-\left|{\mathcal{J}}_1\cup {\mathcal{J}}_2\right|;$ by plugging these results into ${o}_p\left(S/I\right)\ge {J}_{\mathrm{min}}{I}^{-1}{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ , we can obtain

$$\begin{align*}{o}_p\left(\frac{S}{I}\right)+\left|{\mathcal{J}}_1\cup {\mathcal{J}}_2\right|\ge \frac{{\tilde{J}}_{\mathrm{min}}}{I}\sum \limits_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}\ge \frac{J{\unicode{x3b4}}_J}{I}\sum \limits_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}.\end{align*}$$

From Lemma 5, we have $\left|{\mathcal{J}}_i\right|={o}_p\left(S/I\right)$ for $i=1,2$ , which implies that $\left|{\mathcal{J}}_1\cup {\mathcal{J}}_2\right|={o}_p\left(S/I\right)$ . By substituting this into the above inequality, we can conclude that

$$\begin{align*}{o}_p\left(\frac{S}{I}\right)\ge \frac{J{\unicode{x3b4}}_J}{I}\sum \limits_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\},\end{align*}$$

which is equivalent to ${I}^{-1}{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}={o}_p\left({\unicode{x3b3}}_J/{\unicode{x3b4}}_J\right)$ . The proof of this theorem is complete.

The inequality (32) in Assumption 2 bridges between the misclassification error for the local latent classes and the misclassification error for the latent attribute mastery patterns $\widehat{\mathcal{A}}$ by using the inequality ${\sum}_{k=1}^K\kern0.1em \mathcal{I}\left\{\xi \left({\boldsymbol{e}}_k,{\boldsymbol{\unicode{x3b1}}}_i^0\right)\ne \xi \left({\boldsymbol{e}}_k,{\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right)\right\}\ge \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ .

A.6 Proof of Theorem 2

For notational simplicity, denote ${i}_{ja}^0={\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{z}_{ij}^0=a\right\}$ . Thus, Assumption 2 implies that $\forall \boldsymbol{\unicode{x3b1}} \in {\left\{0,1\right\}}^K,{\sum}_{i=1}^I\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0=\boldsymbol{\unicode{x3b1}} \right\}\ge \unicode{x3b5} I$ and

(A.22)

$$\begin{align}{i}_{ja}^0\ge \frac{2^K}{2^{K_j}}I\unicode{x3b5} \ge I\unicode{x3b5} .\end{align}$$

Recall that

$$\begin{align*}{\tilde{\unicode{x3b8}}}_{ja}=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}}.\end{align*}$$

Rewrite ${\unicode{x3b8}}_{ja}^0$ as similar form

$$\begin{align*}{\unicode{x3b8}}_{ja}^0=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{\unicode{x3b8}}_{ja}^0}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}=\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{P}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}.\end{align*}$$

By triangle inequality, we have

$$\begin{align*}\displaystyle &\;\underset{j,a}{\max}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}^0\right|\\ {}&\quad=\;\underset{j,a}{\max}\kern0.1em \left|\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}}-\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{P}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}\right|\\ {}&\quad\le \;\underset{j,a}{\max}\kern0.1em \left|\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}}-\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}\right|\\ {}&\quad\quad +\underset{j,a}{\max}\kern0.1em \left|\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}-\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}\right|\\ {}&\quad\quad +\underset{j,a}{\max}\kern0.1em \left|\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{X}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}-\frac{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}{P}_{ij}}{\sum \limits_{i=1}^I\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}}\right|\\ {}&\quad\equiv {\mathcal{I}}_1+{\mathcal{I}}_2+{\mathcal{I}}_3.\end{align*}$$

Thereafter, we analyze these three terms separately. For the first term,

$$\begin{align*}\displaystyle {\mathcal{I}}_1& \le \underset{j,a}{\max}\kern0.1em \left(\sum \limits_i\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}{X}_{ij}\right)\cdotp \frac{\sum \limits_i\kern0.20em \left|\mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}-\mathcal{I}\left\{{z}_{ij}^0=a\right\}\right|}{i_{ja}^0\sum \limits_i\kern0.20em \mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}}\\ {}& \le \underset{j,a}{\max}\kern0.1em \frac{\sum \limits_i\kern0.20em \left|\mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}-\mathcal{I}\left\{{z}_{ij}^0=a\right\}\right|}{i_{ja}^0}\\ {}& \le \frac{1}{\unicode{x3b5} I}\sum \limits_i\kern0.20em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}={o}_p\left(\frac{\unicode{x3b3}_J}{\unicode{x3b4}_J}\right).\end{align*}$$

The last inequality holds since $\forall j\in \left[J\right],j\in \left[{L}_j\right],{\sum}_i\kern0.1em \left|\mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}-\mathcal{I}\left\{{z}_{ij}^0=a\right\}\right|\le {\sum}_i\kern0.1em \mathcal{I}\left\{{\boldsymbol{\unicode{x3b1}}}_i^0\ne {\widehat{\boldsymbol{\unicode{x3b1}}}}_i\right\}$ . For the second term, we have

$$\begin{align*}{\mathcal{I}}_2=\underset{j,a}{\max}\kern0.1em \frac{\sum \limits_i\kern0.20em \left|{X}_{ij}\left(\mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}-\mathcal{I}\left\{{z}_{ij}^0=a\right\}\right)\right|}{i_{ja}^0}\le \underset{j,a}{\max}\kern0.1em \frac{\sum \limits_i\kern0.20em \left|\mathcal{I}\left\{{\widehat{z}}_{ij}=a\right\}-\mathcal{I}\left\{{z}_{ij}^0=a\right\}\right|}{i_{ja}^0}.\end{align*}$$

For the same reason as ${\mathcal{I}}_1\overset{P}{\to }0$ , we can also conclude that ${\mathcal{I}}_2={o}_p\left({\unicode{x3b3}}_J/{\unicode{x3b4}}_{\mathrm{J}}\right)$ ; thus, ${\mathcal{I}}_1+{\mathcal{I}}_2={o}_p\left({\unicode{x3b3}}_J/{\unicode{x3b4}}_J\right)$ . For the third term, we apply Hoeffding’s inequality for bounded random variables and obtain

$$\begin{align*}P\left(\frac{\sum \limits_i\kern0.20em \mathcal{I}\left\{{z}_{ij}^0=a\right\}\left({X}_{ij}-{P}_{ij}\right)}{i_{ja}^0}\ge t\right)\le 2\exp \left(-2{i}_{ja}^0{t}^2\right)\le 2\exp \left(-2\unicode{x3b5} I{t}^2\right).\end{align*}$$

Note the number of $\left(j,a\right)$ pairs less than or equal to $J{2}^K$ under Assumption 2, we have for $\forall t>0$ ,

(A.23)

$$\begin{align}P\left({\mathcal{I}}_3\ge t\right)\le J{2}^{K+1}\exp \left(-2\unicode{x3b5} I{t}^2\right).\end{align}$$

Notably, ${2}^{K+1}$ remains a constant as $K$ is fixed. By choosing $t=1/\sqrt{I^{1-\tilde{c}}}$ for a small $\tilde{c}>0$ , the tail probability in A.23 converges to zero when the scaling condition $\sqrt{J}=O\left({I}^{1-c}\right)$ holds. This implies that ${\mathcal{I}}_3={o}_p\left(1/\sqrt{I^{1-\tilde{c}}}\right)$ . Bringing together the preceding results, we have

$$\begin{align*}\underset{j,a}{\max}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}^0\right|={o}_p\left(\frac{\unicode{x3b3}_J}{\unicode{x3b4}_J}\right)+{o}_p\left(\frac{1}{\sqrt{I^{1-\tilde{c}}}}\right).\end{align*}$$

Note for any $\left(j,a\right)$ , we have ${\left({\tilde{\unicode{x3b8}}}_{ja}-{\widehat{\unicode{x3b8}}}_{ja}\right)}^2\le \tilde{C}{i}_{ja}^{-1}$ and ${i}_{ja}\ge I{\unicode{x3b5}}_2$ with the probability approaching 1 ; thus ${\max}_{j,a}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\widehat{\unicode{x3b8}}}_{ja}\right|={O}_p\left({I}^{-1/2}\right)$ . Therefore,

$$\begin{align*}\underset{j,a}{\max}\kern0.1em \left|{\widehat{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}^0\right|\le \underset{j,a}{\max}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\unicode{x3b8}}_{ja}^0\right|+\underset{j,a}{\max}\kern0.1em \left|{\tilde{\unicode{x3b8}}}_{ja}-{\hat{\unicode{x3b8}}}_{ja}\right|={o}_p\left(\frac{\unicode{x3b3}_J}{\unicode{x3b4}_J}\right)+{o}_p\left(\frac{1}{\sqrt{I^{1-\tilde{c}}}}\right).\end{align*}$$

References

Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory. John Wiley & Sons.Google Scholar

Bissiri, P. G., Holmes, C. C., & Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5), 1103–1130. https://doi.org/10.1111/rssb.12158 CrossRef Google Scholar PubMed

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01 CrossRef Google Scholar

Chen, Y., Liu, J., Xu, G., & Ying, Z. (2015). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110, 850–866. https://doi.org/10.1080/01621459.2014.934827 CrossRef Google Scholar

Chiu, C.-Y., & Douglas, J. (2013). A nonparametric approach to cognitive diagnosis by proximity to ideal response patterns. Journal of Classification, 30, 225–250. https://doi.org/10.1007/s00357-013-9132-9 CrossRef Google Scholar

Chiu, C.-Y., & Köhn, H.-F. (2019). Consistency theory for the general nonparametric classification method. Psychometrika, 84, 830–845. https://doi.org/10.1007/s11336-019-09660-x CrossRef Google Scholar PubMed

Chiu, C.-Y., Köhn, H.-F., Zheng, Y., & Henson, R. (2016). Joint maximum likelihood estimation for diagnostic classification models. Psychometrika, 81, 1069–1092. https://doi.org/10.1007/s11336-016-9534-9 CrossRef Google Scholar PubMed

Chiu, C.-Y., Sun, Y., & Bian, Y. (2018). Cognitive diagnosis for small educational programs: The general nonparametric classification method. Psychometrika, 83, 355–375. https://doi.org/10.1007/s11336-017-9595-4 CrossRef Google Scholar PubMed

Culpepper, S. A. (2015). Bayesian estimation of the DINA model with Gibbs sampling. Journal of Educational and Behavioral Statistics, 40(5), 454–476. https://doi.org/10.3102/1076998615595403 CrossRef Google Scholar

de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34(1), 115–130. https://doi.org/10.3102/1076998607309474 CrossRef Google Scholar

de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199. https://doi.org/10.1007/s11336-011-9207-7 CrossRef Google Scholar

Grünwald, P., & van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4), 1069–1103. https://doi.org/10.1214/17-BA1085 CrossRef Google Scholar

Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210. https://doi.org/10.1007/s11336-008-9089-5 CrossRef Google Scholar

Holmes, C. C., & Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2), 497–503. https://doi.org/10.1093/biomet/asx010 Google Scholar

Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258–272. https://doi.org/10.1177/01466210122032064 CrossRef Google Scholar

Li, H., Hunter, C. V., & Lei, P.-W. (2016). The selection of cognitive diagnostic models for a reading comprehension test. Language Testing, 33(3), 391–409. https://doi.org/10.1177/0265532215590848 CrossRef Google Scholar

Lyddon, S. P., Holmes, C. C., & Walker, S. G. (2019). General Bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2), 465–478. https://doi.org/10.1093/biomet/asz006 CrossRef Google Scholar

Ma, C., de la Torre, J., & Xu, G. (2023). Bridging parametric and nonparametric methods in cognitive diagnosis. Psychometrika, 88(1), 51–75. https://doi.org/10.1007/s11336-022-09878-2 CrossRef Google Scholar PubMed

Ma, C., Ouyang, J., & Xu, G. (2023). Learning latent and hierarchical structures in cognitive diagnosis models. Psychometrika, 88(1), 175–207. https://doi.org/10.1007/s11336-022-09867-5 CrossRef Google Scholar PubMed

Ma, W., & Jiang, Z. (2021). Estimating cognitive diagnosis models in small samples: Bayes modal estimation and monotonic constraints. Applied Psychological Measurement, 45(2), 95–111. https://doi.org/10.1177/0146621620977681 CrossRef Google Scholar PubMed

MacReady, G. B., & Dayton, C. M. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2(2), 99–120. https://doi.org/10.2307/1164802 CrossRef Google Scholar

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212. https://doi.org/10.1007/BF02294535 CrossRef Google Scholar

Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (124, pp. 1–10). https://www.r-project.org/conferences/DSC-2003/Drafts/Plummer.pdf Google Scholar

Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement, 6(4), 219–262. https://doi.org/10.1080/15366360802490866 Google Scholar

Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.Google Scholar

Sinharay, S. (2006). Model diagnostics for Bayesian networks. Journal of Educational and Behavioral Statistics, 31(1), 1–33. https://doi.org/10.3102/10769986031001001 CrossRef Google Scholar

Sinharay, S. (2015). Assessment of person fit for mixed-format tests. Journal of Educational and Behavioral Statistics, 40(4), 343–365. https://doi.org/10.3102/1076998615589128 CrossRef Google Scholar PubMed

Sinharay, S. (2016). Bayesian model fit and model comparison. In van der Linden, W. J. (Ed.), Handbook of item response theory, volume 2: Statistical tools (pp. 379–394). Chapman and Hall/CRC. https://doi.org/10.1201/b19166 Google Scholar

Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30(4), 298–321. https://doi.org/10.1177/0146621605285517 CrossRef Google Scholar

Syring, N., & Martin, R. (2019). Calibrating general posterior credible regions. Biometrika, 106(2), 479–486. https://doi.org/10.1093/biomet/asy054 CrossRef Google Scholar

Tatsuoka, K. K. (1985). A probabilistic model for diagnosing misconceptions by the pattern classification approach. Journal of Educational Statistics, 10(1), 55–73. https://doi.org/10.3102/10769986010001055 CrossRef Google Scholar

Templin, J. L., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79, 317–339. https://doi.org/10.1007/s11336-013-9362-0 CrossRef Google Scholar PubMed

Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11(3), 287–305. https://doi.org/10.1037/1082-989X.11.3.287 CrossRef Google Scholar PubMed

Templin, J. L., & Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32(2), 37–50. https://doi.org/10.1111/emip.12010 CrossRef Google Scholar

von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307. https://doi.org/10.1348/000711007X193957 CrossRef Google Scholar PubMed

von Davier, M., & Lee, Y.-S. (2019). Handbook of diagnostic classification models: Models and model extensions, applications, software packages. Springer.CrossRef Google Scholar

Wang, S., & Douglas, J. (2015). Consistency of nonparametric classification in cognitive diagnosis. Psychometrika, 80, 85–100. https://doi.org/10.1007/s11336-013-9372-y CrossRef Google Scholar PubMed

Wu, P.-S., & Martin, R. (2023). A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Analysis, 18(1), 105–132. https://doi.org/10.1214/21-BA1302 CrossRef Google Scholar

Xu, G. (2017). Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45(2), 675–707. https://doi.org/10.1214/16-AOS1464 CrossRef Google Scholar

Xu, G., & Shang, Z. (2018). Identifying latent structures in restricted latent class models. Journal of the American Statistical Association, 113(523), 1284–1295. https://doi.org/10.1080/01621459.2017.1340889 CrossRef Google Scholar

Yamaguchi, K., & Okada, K. (2020). Variational Bayes inference for the DINA model. Journal of Educational and Behavioral Statistics, 45(5), 569–597. https://doi.org/10.3102/1076998620911934 CrossRef Google Scholar

Yamaguchi, K., & Templin, J. L. (2022a). Direct estimation of diagnostic classification model attribute mastery profiles via a collapsed Gibbs sampling algorithm. Psychometrika, 87(4), 1390–1421. https://doi.org/10.1007/s11336-022-09857-7 CrossRef Google Scholar

Yamaguchi, K., & Templin, J. L. (2022b). A Gibbs sampling algorithm with monotonicity constraints for diagnostic classification models. Journal of Classification, 39(1), 24–54. https://doi.org/10.1007/s00357-021-09392-7 CrossRef Google Scholar

Table 1 The four-attribute $\mathrm{Q}$-matrix

Table 2 The five-attribute $\mathrm{Q}$-matrix

Table 3 The average correlations of attribute mastery probabilities estimated by first and second halves of MCMC iterations after the burn-in period

Figure 1 Simulation results of the DINA data generation with four-attribute $\mathbf{Q}$-matrix conditions.

Figure 2 Simulation results of the DINA data generation with five-attribute $\mathbf{Q}$-matrix conditions.

Figure 3 Simulation results of the general DCM data generation with four-attribute Q-matrix conditions.

Figure 4 Simulation results of the general DCM data generation with five-attribute Q-matrix conditions.

Figure 5 Box plots of attribute mastery probabilities of the DINA data generation with four-attribute Q-matrix conditions.

Figure 6 Box plots of attribute mastery probabilities of the DINA data generation with five-attribute Q-matrix conditions.

Figure 7 Box plots of attribute mastery probabilities of the general data generation with four-attribute $\mathbf{Q}$-matrix conditions.

Figure 8 Box plots of attribute mastery probabilities of the general data generation with five-attribute Q-matrix conditions.

Table 4 The $\mathrm{Q}$-matrix of ECPE data

Table 5 Means and SDs of posterior attribute mastery probabilities for GBGNPC and GBNPC methods

Table 6 Frequencies and ratios of the estimated attribute mastery patterns with the four estimation methods

Table 7 Contingency table of the estimated attribute mastery patterns by GBGNPC and GNPC

Table 8 Contingency table of the estimated attribute mastery patterns by GBNPC and NPC

Table 9 Individual differences in estimated patterns for GBGNPC and GNPC methods, response patterns, sum- and subscores, and attribute mastery probabilities

Table 10 Generalized posterior of attribute mastery pattern by GBNPC and NPC

Article contents

Generalized Bayesian method for diagnostic classification models

Abstract

Keywords

1 Introduction

2 Model setup and previous estimation methods

2.1 Model setup of DCMs

2.2 Loss function-based parameter estimation

3 GB method for DCMs

3.1 Construction of the generalized posterior

3.2 General form of the GB method for DCMs

4 Mathematical properties of the proposed method: Consistency of the MAP estimators

5 Simulation study

5.1 Simulation settings

5.2 Results

6 Real data example

6.1 Data analysis settings

6.2 Results

7 Discussions and future directions

Data availability statement

Funding statement

Competing interests

1 Appendix

Proofs of Theorems 1 and 2

A.1 Preparation for the Proofs

A.2 Outline of the first half of the proof

A.3 Outline of the second half of the proof

A.4 First Half of the Proof of Theorem 1

A.5 Second Half of the Proof of Theorem 1

A.6 Proof of Theorem 2

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests