5 Modeling

To provide feedback about student performance, the Dynamic Learning Maps® (DLM®) Alternate Assessment System draws on a well-established research base in cognition and learning theory and relatively uncommon operational psychometric methods. The approach uses innovative, operational psychometric methods to provide feedback about student mastery of skills. This chapter describes the psychometric model that underlies the DLM System and describes the process used to estimate item and student parameters from student assessment data.

5.1 Psychometric Background

Learning maps, which are networks of sequenced learning targets, are at the core of the DLM assessments in English language arts and mathematics. In general, a learning map is a collection of skills to be mastered that are linked together by connections between the skills. The connections between skills indicate what should be mastered prior to learning additional skills. Together, the skills and their prerequisite connections map out the progression of learning within a given subject. Stated in the vocabulary of traditional psychometric methods, a learning map defines a large set of discrete latent variables indicating students’ learning status on key skills and concepts relevant to a large content domain, as well as a series of pathways indicating which topics (represented by latent variables) are prerequisites for learning other topics.

Because of the underlying map structure and the goal of providing more fine-grained information beyond a single raw or scale score value, student results are reported as a profile of skill mastery. This profile is created using diagnostic classification modeling, which draws on research in cognition and learning theory to provide information about student mastery of multiple skills measured by the assessment. Diagnostic classification models (DCMs) are confirmatory latent class models that characterize the relationship of observed responses to a set of categorical latent variables (Bradshaw, 2016; e.g., Rupp et al., 2010). DCMs are also known as cognitive diagnosis models (e.g., Leighton & Gierl, 2007) or multiple classification latent class models (Maris, 1999) and are mathematically equivalent to Bayesian networks (e.g., Almond et al., 2015; Mislevy & Gitomer, 1995; Pearl, 1988). This is the main difference from more traditional psychometric models, such as item response theory, which model a single, continuous latent variable. DCMs provide information about student mastery on multiple latent variables or skills of interest.

DCMs have primarily been used in educational measurement settings in which more detailed information about test-takers’ skills is of interest, such as in assessing individual mathematics skills (e.g., Bradshaw et al., 2014), different levels of reading complexity (e.g., Templin & Bradshaw, 2014), and the temporal acquisition of science skills (e.g., Templin & Henson, 2008). To provide detailed profiles of student mastery of the skills, or attributes, measured by the assessment, DCMs require the specification of an item-by-attribute Q-matrix, indicating the attributes measured by each item. In general, for a given item, \(i\), the Q-matrix vector would be represented as \(q_i=[q_{i1},q_{i2},…,q_{iA}]\), where \(A\) is the total number of attributes. Similar to a factor pattern matrix in a confirmatory factor model, Q-matrix indicators are binary: either the item measures an attribute (\(q_{ia}=1\)) or it does not (\(q_{ia}=0\)).

For each item, there is a set of conditional item-response probabilities that corresponds to the student’s possible mastery patterns. Although DCMs can be defined using any number of latent categories for each attribute, it is most common to use binary attributes, which provide more interpretable results to stakeholders (Bradshaw & Levy, 2019). When an item measures a single binary attribute, only two statuses are possible for any examinee: a master of the attribute or a nonmaster of the attribute.

In general, the modeling approach involves specifying the Q-matrix, determining the probability of being classified into each category of mastery (master or nonmaster), and relating those probabilities to students’ response data to determine a posterior probability of being classified as a master or nonmaster for each attribute. For DLM assessments, the attributes for which probabilities of mastery are calculated are the Essential Element (EE) linkage levels.

5.2 Essential Elements and Linkage Levels

Because the primary goal of the DLM assessments is to measure what students with the most significant cognitive disabilities know and can do, alternate grade-level expectations called EEs were created to provide students in the population access to the general education grade-level academic content. See Chapter 2 of this manual for a complete description. Each EE has an associated set of linkage levels that are ordered by increasing complexity. There are five linkage levels for each EE in English language arts and mathematics: Initial Precursor, Distal Precursor, Proximal Precursor, Target, and Successor.

5.3 Overview of the DLM Modeling Approach

Many statistical models are available for estimating the probability of mastery for attributes in a DCM. The statistical model used to determine the probability of mastery for each linkage level for DLM assessments is the log-linear cognitive diagnosis model (LCDM). The LCDM is a DCM model that provides a general statistical framework for obtaining probabilities of class membership for each measured attribute (Henson et al., 2009). Student mastery statuses for each linkage level are obtained from a Bayesian estimation procedure, which contributes to an overall profile of mastery.

5.3.1 Model Specification

Each linkage level was calibrated separately for each EE using separate LCDMs. Each linkage level within an EE is estimated separately because of the administration design in which overlapping data from students taking testlets at multiple levels within an EE is uncommon. Also, because items were developed to meet a precise cognitive specification, all master and nonmaster probability parameters for items measuring a linkage level were assumed to be equal. That is, all items were assumed to be fungible, or exchangeable, within a linkage level. As such, each class (i.e., masters or nonmasters) has a single probability of responding correctly to all items measuring the linkage level, as depicted in Table 5.1. Therefore, for each item measuring the same linkage level, the probability of providing a correct response is held constant for all students in each mastery class. Chapter 3 of this manual details item review procedures intended to support the fungibility assumption, and section 5.4.1 of this chapter describes empirical evidence to support this constraint.

Table 5.1: Depiction of Fungible Item Parameters for Items Measuring a Single Linkage Level
Item Class 1 (Nonmasters) Class 2 (Masters)
1 \(\pi_1\) \(\pi_2\)
2 \(\pi_1\) \(\pi_2\)
3 \(\pi_1\) \(\pi_2\)
4 \(\pi_1\) \(\pi_2\)
5 \(\pi_1\) \(\pi_2\)
Note. \(\pi\) represents the probability of providing a correct response.

The DLM scoring model for the 2021–2022 administration was as follows. Each linkage level within each EE was considered the latent variable to be measured (the attribute). Using DCMs, a probability of mastery on a scale of 0 to 1 was calculated for each linkage level within each EE. Students were then classified into one of two classes for each linkage level of each EE: either master or nonmaster. As described in Chapter 6 and Chapter 7 of this manual, a posterior probability of at least .8 was required for mastery classification.

The general form of DCMs is shown in Equation (5.1). In Equation (5.1), \(\pi_{ic}\) is the conditional probability of a student in class \(c\) providing a correct response to item \(i\), and \(x_{ij}\) is the observed response (i.e., 0 or 1) of student \(j\) to item \(i\). Thus, \(\pi_{ic}^{x_{ij}}(1-\pi_{ic})^{1-x_{ij}}\) represents the probability of a respondent in class \(c\) providing the observed response to item \(i\). Finally, \(\nu_c\) represents the base rate probability that any given respondent belongs to class \(c\).

\[\begin{equation} P(X_j=x_j) = \sum_{c=1}^C\nu_c\prod_{i=1}^I\pi_{ic}^{x_{ij}}(1-\pi_{ic})^{1-x_{ij}} \tag{5.1} \end{equation}\]

Different types of DCMs use different measurement models to define \(\pi_{ic}\) in Equation (5.1). For DLM assessments, item responses are modeled using the LCDM, as described by Henson et al. (2009). The LCDM defines the conditional probabilities using a generalized linear model with a logit link function. Specifically, using the LCDM, \(\pi_{ic}\) is defined as seen in Equation (5.2), where \(\alpha_c\) is a binary indicator of mastery status for a student in class \(c\) for that attribute.

\[\begin{equation} \pi_{ic}=P(X_{ic}=1|\alpha_c) = \frac{\exp(\lambda_{i,0} + \lambda_{i,1}\alpha_c)}{1 + \exp(\lambda_{i,0} + \lambda_{i,1}\alpha_c)} \tag{5.2} \end{equation}\]

Equation (5.2) utilizes the LCDM notation described by Rupp et al. (2010), where the \(\lambda\) subscripts follow structure of “item,effect”, where the effect is defined as 0 = intercept, 1 = main effect. All items in a linkage level were assumed to measure that linkage level, meaning the Q-matrix for the linkage level was a column of ones. As such, each item measured one latent variable, resulting in two parameters per item: (a) an intercept (\(\lambda_{i,0}\)) that corresponds to the probability of answering the item correctly for examinees who have not mastered the linkage level and (b) a main effect (\(\lambda_{i,1}\)) that corresponds to the increase in the probability of answering the item correctly for examinees who have mastered the linkage level. Because students who have mastered the linkage level should also have a higher probability of providing a correct response than students who have not, \(\lambda_{i,1}\) is constrained to be positive to ensure monotonicity (Henson et al., 2009). As per the assumption of item fungibility, a single set of probabilities was estimated for all items within a linkage level. Therefore, Equation (5.2) can be simplified to remove the item-level \(\lambda\) parameters. Equation (5.3) removes the item-level effects, showing that for each linkage level we now estimate only one intercept shared by all items (\(\lambda_0\)) and one main effect shared by all items (\(\lambda_1\)).

\[\begin{equation} \pi_{ic}=P(X_{ic}=1|\alpha_c) = \frac{\exp(\lambda_{0} + \lambda_{1}\alpha_c)}{1 + \exp(\lambda_{0} + \lambda_{1}\alpha_c)} \tag{5.3} \end{equation}\]

Finally, because each linkage level is estimated separately as a single attribute LCDM, there are only two possible mastery classes (i.e., nonmasters and masters). Therefore, only a single structural parameter was needed (\(\nu\)), which is the probability that a randomly selected student who is assessed on the linkage level is a master (i.e., the analogous map parameter). The base rate of the other class (i.e., nonmastery) is deterministically calculated as \(1 - \nu\). In total, three parameters per linkage level are specified in the DLM scoring model: a fungible intercept, a fungible main effect, and the proportion of masters.

5.3.2 Model Calibration

A Bayesian approach was used to calibrate the DCMs. A Bayesian approach was preferred over a simpler maximum likelihood approach because the posterior distributions derived from Bayesian methods offer more robust methods for evaluating model fit (see section 5.4.1). We specifically selected an empirical Bayes procedure for several reasons. In any Bayesian approach, prior distributions must be specified for each parameter in the model. An Empirical Bayes procedure uses the data to estimate a prior distribution, whereas a standard Bayesian procedure would fix the prior distribution a priori (Carlin & Louis, 2001). The empirical priors offer several advantages. First, due to the number of models that are estimated (i.e., 1,275 linkage levels), an a priori specification of prior distributions would require the same priors for each model. By using empirical priors, we can select prior distributions specific to each linkage level, rather than using a single general prior that may be more or less appropriate for any given linkage level, thus increasing the information available in the estimation process (Nabi et al., 2022). Second, if a priori priors are used, there are many decisions that a practitioner must make when eliciting the fixed priors. Different decisions lead to different prior distributions, which would then affect the resulting posterior distributions (Falconer et al., 2022; Stefan et al., 2020). Using empirical priors removes the practitioner degrees of freedom that could result in different priors resulting from practitioner decisions. Finally, empirical prior distributions are often more informative than a general prior fixed a priori. More informative priors make the estimation more efficient by focusing the sampling more closely on the highest density area of the posterior distribution without biasing the final parameter estimates (Petrone et al., 2014).

Across all grades and subjects, there were 255 EEs, each with five linkage levels, resulting in a total of 255 \(\times\) 5 \(=\) 1,275 separate calibration models. Each separate calibration included all operational items for the EE and linkage level. Each model was estimated using a two-step Empirical Bayes procedure (Casella, 1985; Efron, 2014) using the software package rstan (Stan Development Team, 2022). The rstan package is an interface to the Stan probabilistic programming language (Carpenter et al., 2017). The first step of the process used for calibrating the DLM model consists of fitting a number of bootstrapped models with an optimization algorithm to estimate the standard error of each parameter. The second step then used the standard errors from step 1 as prior distributions in a fully Bayesian estimation of the model using a Markov Chain Monte Carlo procedure. Each step is described in detail in the following sections.

5.3.2.1 Step 1: Estimation of Bootstrapped Models

In the first step, the data for each attribute were bootstrap resampled 100 times (Babu, 2011). For each bootstrap resample, the LCDM was fit using the low-memory Broyden-Fletcher-Goldfarb-Shanno optimization algorithm (Liu & Nocedal, 1989; Nocedal & Wright, 2006). The low-memory Broyden-Fletcher-Goldfarb-Shanno algorithm is a widely used maximum likelihood optimization algorithm that can efficiently estimate many types of models, including the LCDM. After estimating the LCDM on each of the bootstrapped samples, there are 100 estimates of each of the three model parameters. We denote these parameters as:

\[\begin{align*} \lambda_0^* &= [\lambda_{0_1}, \lambda_{0_2}, \lambda_{0_3}, ..., \lambda_{0_{100}}] \\ \lambda_1^* &= [\lambda_{1_1}, \lambda_{1_2}, \lambda_{1_3}, ..., \lambda_{1_{100}}] \\ \nu^* &= [\nu_1, \nu_2, \nu_3, ..., \nu_{100}] \end{align*}\]

For each parameter, we calculated the mean value and standard deviation across the 100 bootstrap samples. These values were then used to define the prior distributions in the second step.

5.3.2.2 Step 2: Estimation of Final Bayesian Model

In the second step, the full data set was used to estimate the LCDM for each linkage level using Markov Chain Monte Carlo and the Hamiltonian Monte Carlo algorithm (Betancourt, 2018; Neal, 2011). The prior distribution for each parameter is defined using the values from the first step. The prior for the intercept (\(\lambda_0\)) is defined as a normal distribution with a mean and standard deviation equal to the corresponding values from the first step. We define the mean and standard deviation of \(\lambda_0^*\) as \(\mu_{\lambda_0^*}\) and \(\sigma_{\lambda_0^*}\), respectively. The prior for the intercept in the LCDM is then given as:

\[\begin{equation} \lambda_0 \sim \mathcal{N}(\mu_{\lambda_0^*},\ \sigma_{\lambda_0^*}) \tag{5.4} \end{equation}\]

Similarly, the main effect parameter (\(\lambda_1\)) is also defined with a normal distribution. However, the prior distribution of the main effect is truncated at 0, forcing the main effect to be positive to ensure monotonicity in the LCDM.

\[\begin{equation} \lambda_1 \sim \begin{cases} 0, & \text{if}\ \lambda_1 \leq 0 \\ \mathcal{N}(\mu_{\lambda_1^*},\ \sigma_{\lambda_1^*}), & \text{otherwise} \end{cases} \tag{5.5} \end{equation}\]

Finally, because the base rate of linkage level class membership (\(\nu\)) is a probability that must be between 0 and 1, a beta distribution is used for the prior. The beta distribution is governed by two shape parameters, \(\alpha\) and \(\beta\). Given these two shape parameters, the mean of the beta distribution is given by:

\[\begin{equation} \mu = \frac{\alpha}{\alpha + \beta} \tag{5.6} \end{equation}\]

and the variance as:

\[\begin{equation} \sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \tag{5.7} \end{equation}\]

With some algebra, we calculate the values of the shape parameters given a mean (\(\mu_{\nu^*}\)) and standard deviation (\(\sigma_{\nu^*}\)):

\[\begin{align} \alpha_{\nu^*} &= \bigg(\frac{1 - \mu_{\nu^*}}{\sigma_{\nu^*}^2} - \frac{1}{\mu_{\nu^*}}\bigg)\mu_{\nu^*}^2 \\ \notag \\ \beta_{\nu^*} &= \alpha_{\nu^*}\bigg(\frac{1}{\mu_{\nu^*}} - 1\bigg) \end{align}\]

The prior distribution for \(\nu\) is then defined as:

\[\begin{equation} \nu \sim \mathcal{B}(\alpha_{\nu^*},\ \beta_{\nu^*}) \tag{5.8} \end{equation}\]

After the prior distributions have been defined, the LCDM was estimated. To ensure the posterior was adequately explored, four chains were estimated. For each chain, we specified 2,000 warm-up iterations and retained 1,000 post-warm-up iterations. This resulted in a posterior distribution of 4,000 draws (i.e., 1,000 from each of the four chains). After estimation, we ensured that the model had converged and adequately explored the posterior space by evaluating the \(\widehat{R}\) and effective sample size metrics described by Vehtari et al. (2021). Using the cutoffs recommended by Vehtari et al. (2021), we ensured that all \(\widehat{R}\) values were below 1.01 and that all effective sample sizes were greater than 400. After model evaluation, the mean of the posterior distribution for each of the three model parameters was taken. These parameter estimates were then used for scoring the DLM assessments.

5.3.3 Estimation of Student Mastery Probabilities

Once the LCDM parameters have been calibrated, student mastery probabilities are then obtained for each assessed linkage level. For DLM scoring, student mastery probabilities are expected a posteriori, or EAP, estimates. This is also the method most commonly used in scale score assessments (e.g., item response theory). For a thorough discussion of the EAP estimates in scale score and diagnostic settings, see Chapter 10 of Rupp et al. (2010). For each student \(j\) and linkage level \(l\), EAP estimates of mastery probability, \(\hat{\alpha}_{jl}\), are obtained using the following formula:

\[\begin{equation} \hat{\alpha}_{jl} = \frac{\prod_{i=1}^{I_j}\big[\pi_{i1}^{X_{ji}}(1 - \pi_{i1})^{(1 - X_{ji})}\big]\nu_1}{\sum_{c = 0}^1\prod_{i=1}^{I_j}\big[\pi_{ic}^{X_{ji}}(1 - \pi_{ic})^{(1 - X_{ji})}\big]\nu_c} \tag{5.9} \end{equation}\]

In Equation (5.9), \(X_ji\) is the dichotomous response of student \(j\) to item \(i\) and \(\pi_{ic}\) is the model-based probability of answering item \(i\) correctly, conditional on student \(j\) having mastery status \(c\) for the linkage level, as defined in Equation (5.3). The mastery status can take two values: masters (\(c\) = 1) and nonmasters (\(c\) = 0). Finally, \(\nu_c\) is the base rate probability of membership in each mastery status (see Equation (5.1)). Thus, the numerator represents the likelihood of the student being in class \(c = 1\) (i.e., the master class), and the denominator is the total likelihood across both classes. The EAP estimate is then the proportion of the total likelihood that comes from the master class.

5.4 Model Evaluation

There are many ways to evaluate DCMs. Ravand & Baghaei (2020) suggest four main areas for evaluation: (1) fit, (2) classification consistency and accuracy, (3) item discrimination, and (4) congruence of attribute difficulty with substantive expectations. Fit can be further broken down into different types of fit (e.g., model fit and item fit).

Many of these aspects are described in other sections of this manual and published research. Item fit is described with other measures of item quality in Chapter 3 of this manual, and classification consistency is discussed with other measures of consistency and reliability in Chapter 8 of this manual. The congruence of difficulty and expectations is discussed in Chapter 2 of this manual and the work of W. J. Thompson & Nash (2019) and W. J. Thompson & Nash (2022). Finally, item discrimination is described later in this chapter in section 5.5.3, in the context of estimated model parameters.

In this section, we focus on two aspects that are critical to inferences of student mastery: model fit and classification accuracy. Model fit has important implications for the validity of inferences that can be made from assessment results. If the model used to calibrate and score the assessment does not fit the data well, results from the assessment may not accurately reflect what students know and can do. Also called absolute model fit (e.g., Chen et al., 2013), this aspect involves an evaluation of the alignment between the three parameters estimated for each linkage level and the observed item responses. The second aspect is classification accuracy. This refers to how well the classifications represent the true underlying latent class. The accuracy of the assessment results (i.e., the classifications) is a prerequisite for any inferences that would be made from the results. Thus, the accuracy of the classifications is perhaps the most crucial aspect of model evaluation from a practical and operational standpoint. These aspects are discussed in the following sections.

5.4.1 Model Fit

Absolute model fit is evaluated through posterior predictive model checks, as described by W. J. Thompson (2019). Using parameter posterior distributions, we create a distribution for the expected number of students at each raw score point (i.e., the number of correct item responses). We then compare the observed number of students at each score point to the expected distribution using a \(\chi^2\)-like statistic. Finally, we can compare our \(\chi^2\)-like statistic to a distribution of what would be expected, given the expected distributions of students at each score point. This results in a posterior predictive p-value (ppp), which represents how extreme our observed statistic is compared to the model-implied expectation. Very low values indicate poor model fit, whereas very high values may indicate overfitting. For details on the calculation of this statistic, see W. J. Thompson (2019).

Due to the large number of models being evaluated (i.e., 1,275 linkage levels), the ppp values were adjusted using the Holm correction, which is uniformly more powerful than the popular Bonferroni method (Holm, 1979). Linkage levels were flagged for misfit if the adjusted ppp value was less than .05. Table 5.2 shows the percentage of models with acceptable model fit by linkage level. Across all linkage levels, 994 (78%) of the estimated models showed acceptable model fit. Misfit was not evenly distributed across the linkage levels. The lower linkage levels were flagged at a higher rate than the higher linkage levels. This is likely due to the greater diversity in the student population at the lower linkage levels (e.g., required supports, expressive communication behaviors, etc.), which may affect item response behavior. To address the model misfit flags, we are prioritizing test development for linkage levels flagged for misfit so that testlets contributing to misfit can be retired. For a description of item development practices, see Chapter 3 of this manual. We also plan to incorporate additional item quality statistics to the review of field test data to ensure that only items and testlets that conform to the model expectations are promoted to the operational assessment. Overall, however, the fungible LCDM models appear to largely reflect the observed data. Additionally, model fit is evaluated on an annual basis and continues to improve over time as a result of adjustments to the pool of available content (i.e., improved item writing practices, retirement of testlets contributing to misfit). Finally, it should be noted that a linkage level flagged for model misfit may still have high classification accuracy, indicating that student mastery classifications can be trusted, even in the presence of misfit.

Table 5.2: Percentage of Models With Acceptable Model Fit (ppp > .05)
Linkage Level English Language Arts (%) Mathematics (%)
Initial Precursor 56.8 25.2
Distal Precursor 75.0 79.4
Proximal Precursor 85.1 78.5
Target 93.2 86.0
Successor 96.6 97.2

5.4.2 Classification Accuracy

The most practically important aspect of model fit for DCMs is classification accuracy. Classification accuracy is a measure of how accurate or uncertain classification decisions are for a given attribute in a DCM (the linkage level for DLM assessments). This measure of model fit is conceptualized by a summary of a 2 \(\times\) 2 contingency table of the true and model-estimated mastery statuses (Sinharay & Johnson, 2019). For a discussion of the closely related classification consistency, see Chapter 8 of this manual.

For an operational assessment, we do not know students’ true mastery status. However, we can still estimate the classification accuracy for each linkage level, as shown by Wang et al. (2015) and Johnson & Sinharay (2018) with

\[\begin{equation} \hat{P}_{A} = \frac{1}{N}\sum_{n=1}^N \tilde{\alpha}P(\alpha = 1| \mathbf{X} = \mathbf{x}_n) + \frac{1}{N}\sum_{n=1}^N (1 - \tilde{\alpha})P(\alpha=0|\mathbf{X} = \mathbf{x}_n). \tag{5.10} \end{equation}\]

In Equation (5.10), \(N\) is the total number of students, \(\tilde{\alpha}\) is the model-estimated mastery status, and \(P(\alpha = 1|\mathbf{X} = \mathbf{x}_n)\) is the model-estimated probability that the linkage level was mastered (or not mastered for \(\alpha = 0\)). Johnson & Sinharay (2018) recommended interpretive guidelines for the classification accuracy, \(\hat{P}_A\): .99 = Excellent, .95–.98 = Very Good, .89–.94 = Good, .83–.88 = Fair, .55–.82 = Poor, and <.55 = Weak.

Across all estimated models, 954 linkage levels (75%) demonstrated at least fair classification accuracy. Table 5.3 shows the number and percentage of models within each linkage level that demonstrated each category of classification accuracy. Results are fairly consistent across linkage levels, with no one level showing systematically higher or lower accuracy. As was the case for model misfit, linkage levels flagged for low classification accuracy are prioritized for test development.

Table 5.3: Estimated Classification Accuracy by Linkage Level
Weak
(%)
Poor
(%)
Fair
(%)
Good
(%)
Very Good
(%)
Excellent
(%)
Linkage Level 0.00–.55 .55–.82 .83–.88 .89–.94 .95–.98 .99–1.00
English Language Arts
Initial Precursor 0 (0.0)   2   (1.4) 26 (17.6) 84 (56.8) 25 (16.9) 11 (7.4)
Distal Precursor 0 (0.0) 35 (23.6) 56 (37.8) 30 (20.3) 15 (10.1) 12 (8.1)
Proximal Precursor 0 (0.0) 63 (42.6) 49 (33.1) 25 (16.9)   6   (4.1)   5 (3.4)
Target 0 (0.0) 51 (34.5) 42 (28.4) 35 (23.6) 13   (8.8)   7 (4.7)
Successor 0 (0.0) 50 (33.8) 35 (23.6) 44 (29.7) 13   (8.8)   6 (4.1)
Mathematics
Initial Precursor 0 (0.0)   1   (0.9) 14 (13.1) 76 (71.0) 16 (15.0)   0 (0.0)
Distal Precursor 0 (0.0) 34 (31.8) 42 (39.3) 27 (25.2)   3   (2.8)   1 (0.9)
Proximal Precursor 0 (0.0) 35 (32.7) 30 (28.0) 32 (29.9)   9   (8.4)   1 (0.9)
Target 0 (0.0) 19 (17.8) 40 (37.4) 36 (33.6) 11 (10.3)   1 (0.9)
Successor 0 (0.0) 31 (29.0) 35 (32.7) 31 (29.0) 10   (9.3)   0 (0.0)

When looking at absolute model fit and classification accuracy in combination, linkage levels flagged for absolute model misfit often have high classification accuracy. Of the 281 linkage levels that were flagged for absolute model misfit, 259 (92%) showed fair or better classification accuracy. Thus, even when misfit is present, we can be confident in the accuracy of the mastery classifications. In total, 98% of linkage levels (n = 1,253) had acceptable absolute model fit and/or acceptable classification accuracy.

5.5 Calibrated Parameters

As stated in the previous section, the item parameters for diagnostic assessments are the conditional probability of nonmasters providing a correct response (i.e., the inverse logit of \(\lambda_0\)) and the conditional probability of masters providing a correct response (i.e., the inverse logit of \(\lambda_0 + \lambda_1\)). Because of the assumption of fungibility, parameters are calculated for each of the 1,275 linkage levels across English language arts and mathematics. Across all linkage levels, the conditional probability that masters provide a correct response is generally expected to be high, while it is expected to be low for nonmasters. In addition to the item parameters, the psychometric model also includes a structural parameter, which defines the base rate of class membership for each linkage level. A summary of the operational parameters used to score the 2021–2022 assessment is provided in the following sections.

5.5.1 Probability of Masters Providing Correct Response

When items measuring each linkage level function as expected, students who have mastered the linkage level have a high probability of providing a correct response to items measuring the linkage level. Instances where masters have a low probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure, or that students who have mastered the content select a response other than the key. These instances may result in students who have mastered the content providing incorrect responses and being incorrectly classified as nonmasters. This outcome has implications for the validity of inferences that can be made from results, including educators using results to inform instructional planning in the subsequent year.

Using the 2021–2022 operational calibration, Figure 5.1 depicts the conditional probability of masters providing a correct response to items measuring each of the 1,275 linkage levels. Because the point of maximum uncertainty is .50 (i.e., equal likelihood of mastery or nonmastery), masters should have a greater than 50% chance of providing a correct response. The results in Figure 5.1 demonstrate that the vast majority of linkage levels (n = 1,257, 99%) performed as expected. Additionally, 95% of linkage levels (n = 1,207) had a conditional probability of masters providing a correct response over .60. Only a few linkage levels (n = 5, <1%) had a conditional probability of masters providing a correct response less than .40. Of these five linkage levels with a conditional probability of masters providing a correct response less than .40, the Successor linkage level was the most prevalent, with three linkage levels (60%). Thus, the vast majority of linkage levels performed consistently with expectations for masters of the linkage levels.

Figure 5.1: Probability of Masters Providing a Correct Response to Items Measuring Each Linkage Level

5.5.2 Probability of Nonmasters Providing Correct Response

When items measuring each linkage level function as expected, nonmasters of the linkage level have a low probability of providing a correct response to items measuring the linkage level. Instances where nonmasters have a high probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure, or that the correct answers to items measuring the level are easily guessed. These instances may result in students who have not mastered the content providing correct responses and being incorrectly classified as masters. This outcome has implications for the validity of inferences that can be made from results and for educators using results to inform instructional planning in the subsequent year.

Figure 5.2 summarizes the probability of nonmasters providing correct responses to items measuring each of the 1,275 linkage levels. There is greater variation in the probability of nonmasters providing a correct response to items measuring each linkage level than was observed for masters, as shown in Figure 5.2. While the majority of linkage levels (n = 987, 77%) performed as expected, nonmasters sometimes had a greater than .50 chance of providing a correct response to items measuring the linkage level. Although most linkage levels (n = 689, 54%) have a conditional probability of nonmasters providing a correct response less than .40, 102 (8%) have a conditional probability for nonmasters providing a correct response greater than .60, indicating there are many linkage levels where nonmasters are more likely than not to provide a correct response. This may indicate the items (and linkage level as a whole, since the item parameters are shared) were easily guessable or did not discriminate well between the two groups of students. Of these 102 linkage levels with a conditional probability for nonmasters providing a correct response greater than .60, the Successor linkage level was the most prevalent with 61 linkage levels (60%).

Figure 5.2: Probability of Nonmasters Providing a Correct Response to Items Measuring Each Linkage Level

5.5.3 Item Discrimination

The discrimination of a linkage level represents how well the items are able to differentiate masters and nonmasters. For diagnostic models, this is assessed by comparing the conditional probabilities of masters and nonmasters providing a correct response. Linkage levels that are highly discriminating will have a large difference between the conditional probabilities, with a maximum value of 1.00 (i.e., masters have a 100% chance of providing a correct response and nonmasters a 0% chance). Figure 5.3 shows the distribution of linkage level discrimination values. Overall, 69% of linkage levels (n = 885) have a discrimination greater than .40, indicating a large difference between the conditional probabilities (e.g., .75 to .35, .90 to .50, etc.). However, there were 37 linkage levels (3%) with a discrimination of less than .10, indicating that masters and nonmasters tend to perform similarly on items measuring these linkage levels. Of these 37 linkage levels with a discrimination of less than .10, the Successor linkage level was the most prevalent, with 19 linkage levels (51%).

Figure 5.3: Difference Between Masters’ and Nonmasters’ Probability of Providing a Correct Response to Items Measuring Each Linkage Level

5.5.4 Base Rate Probability of Class Membership

The base rate of class membership is the DCM structural parameter and represents the estimated proportion of students in each class for each EE and linkage level. A base rate close to .50 indicates that students assessed on a given linkage level are, a priori, equally likely to be a master or nonmaster. Conversely, a high base rate would indicate that students testing on a linkage level are, a priori, more likely to be masters. Figure 5.4 depicts the distribution of the base rate probabilities. Overall, 69% of linkage levels (n = 879) had a base rate of mastery between .25 and .75. On the edges of the distribution, 121 linkage levels (9%) had a base rate of mastery less than .25, and 275 linkage levels (22%) had a base rate of mastery higher than .75. This indicates that students are more likely to be assessed on linkage levels they have mastered than those they have not mastered.

Figure 5.4: Base Rate of Linkage Level Mastery

5.6 Conclusion

In summary, the DLM modeling approach uses well-established research in Bayesian inference networks and diagnostic classification modeling to determine student mastery of skills measured by the assessment. A DCM is estimated for each linkage level of each EE to determine the probability of student mastery. Items within the linkage level are assumed to be fungible, with equivalent item probability parameters for masters and nonmasters, owing to the conceptual approach used to construct DLM testlets. An analysis of the estimated models indicates that the estimated models have acceptable levels of absolute model fit and classification accuracy. Additionally, the estimated parameters from each DCM are generally within the optimal ranges.