Keywords

7.1 Building a Foundation

The data we are considering here are usually a set of ordinal responses or ratings by a sample of persons to an instrument, which is a set of questions or test items. Typically, the items are carefully written and are appropriate for the targeted people. The data are collected and then statistical manipulation is performed to extract from those data a numerical scale for the intended latent variable. The extraction is considered successful if the statistical model employed is deemed to describe the data satisfactorily.

As a first step away from the specifics of this type of scaling, we need measurement models which not only surmount the idiosyncrasies of individual datasets, but also intentionally construct linear measures as independent as statistically possible of the specifics of the items and persons generating the data. Then departures in the data from these measurement models become failures in the data to support generalizable linear measurement, rather than failures in the statistical model to describe the characteristics of the data exactly. This is one crucial distinction between statistical models, which describe the data, and measurement models, which prescribe the data, and there are other distinctions [7].

Of course, empirical data always fall short of an ideal, so the criterion for acceptance of data points becomes their utility. For instance, suppose a student makes several lucky guesses on a multiple-choice arithmetic test. Social fairness may mandate that the student be credited with these successes and measured accordingly. However, this measurement may result in the student being placed in an advanced arithmetic class for which the student is not prepared. An analogous situation when measuring a child’s height would be that the child stood on tip-toe at the moment of measuring. The child’s extra height might allow the child to ride a roller-coaster on which the child cannot be restrained safely. However, if a student must be credited with lucky guesses, an immediate solution is to disregard the lucky guesses while constructing the measurement scale so that guessing does not distort the scale itself. Then, when measuring each student on this scale, lucky guesses are allowed to increase the students’ measures, but are flagged and brought to the attention of the student, teacher, parent, etc., utilizing tools like the “Kidmap” [5, 45], so that the responses’ status as exceptional and unexpected can inform instruction.

Models which meet our requirement for intentional linear measurement are the Rasch model [30] and its extensions [1, etc.]. These construct linear measures from a correspondence of ordinal data with explanatory theory, and also support the examination of individual data points (responses, ratings, etc.) for their ability to support linear measurement of a latent variable. Once the empirical capacity to obtain data of the needed quality is established, then the latent variable itself can be investigated as to whether it facilitates fit-for-purpose comparisons. However, a form of data-dependency remains in any local scaling application in that each set of linear measures constructed from different instruments and groups of respondents has its own origin (zero-point) and unit size, based on the logit, as these emerge from the given dataset. Logits are a probabilistic unit whose substantive size depends on its context [21]. In applying a model of measurement facilitating the construction of linear measures for each instrument and so quantifying a given measurand in a particular unit, we have advanced, but not far enough.

7.2 The Need for a Universal Measurement Scale for a Variable

“Un roi, une loi, un poids, et une mesure” (One king, one law, one weight, and one measure) was an early slogan of the French Revolution [12]. It summarized the situation that, although all the weights and measures were linear, they differed in different locations and for different classes of people. The French peasants were victims of this system. Consequently, in 1799, the metric system was launched in France to generalize all weights and measures.

Disparities between measurement scales in the social sciences have a less obvious economic and social impact, but we can see their effect [8, 28]. Physical science, based on generalized measures, is advancing rapidly. Findings can be shared easily and productively. On the other hand, social science journals report endlessly on the construction of local measurement scales, but even when the opportunities present themselves, these scales are not combined together to aid the advancement of social science. Thurstone [38, p. 10] understood that mathematical modeling and measurement are not merely useful tools, but should be the very language in which one thinks. We then unnecessarily encumber our communications and ability to learn when we fail to link our measures together in common languages. But even when linear measures are obtained, current practice in psychology and the social sciences reports results from instruments measuring the same thing in different units. The following illustrations are intended to show that an ever-growing Tower of Babel need not be taken for granted. On the contrary, correspondences between different instruments measuring the same thing can be leveraged systematically so as to more explicitly link new results with old.

7.2.1 Our Objective: Combining Local Measurement Scales

Let us demonstrate methods for combining local measurement scales into universal measurement scales using the example of Phobia. According to Kessler et al. [13], 18.1% of adults in the U.S.A. suffer from anxiety disorders. Phobias are a common form of anxiety disorders, so let us construct linear measures of phobia intensity. We will do this using the dataset from Imaizumi and Tanno [11], but will ignore their analysis and findings. The dataset consists of the responses of 582 Japanese adults to 17 questions (symptoms) relating to their experience of Trypophobia. Only one symptom directly mentions this specific phobia, so here it is slightly rewritten to apply to any phobia. The responses were on a rating scale from 1 (“Not at all”) to 5 (“Extremely”). Our aim is to construct a measurement “ruler” that can be applied to any phobia by anyone. In so doing, we will show how satisfaction of the invariance, unidimensionality, and construct definition requirements of measurement set the stage not only for equating the instruments to a common metric, but for metrological traceability and quality assurance.

Table 7.1 shows the 17 phobia symptoms in the dataset, together with a one-word abbreviation of each symptom. In this chapter, three separate samples of clients are extracted from the dataset, along with their responses to three different, but overlapping, subsets of questions. The symptoms for each sample are identified in Table 7.1, where “X” or a code letter indicates that this symptom is selected for the sample.

Table 7.1 Seventeen phobia symptoms and their usage in the samples analyzed

Each sample and subset of questions mimics the collection of data on separate questionnaires with some equivalent questions. The data from each sample will be analyzed with different estimation methods implemented in different software. Based on the estimates (measures) for the three samples, a linear measurement scale of all 17 symptoms will be constructed which could be the basis for combining further phobia questionnaires or constructing new ones. Using this measurement scale, client measures based on any subset of the 17 symptoms can be expressed as measures on the scale of all 17 symptoms, demonstrating that the measures can be independent of the particular subset of questions answered.

This demonstration focuses on the technical aspects of constructing and combining measurement scales. In a full implementation, we would require content experts, such as psychologists, to verify that we are measuring what we intend to measure (Construct Validity) and that the measures make sense when applied to clients (Predictive Validity).

7.2.2 Constructing a Local Measurement Scale for a Latent Variable

Sample 1 is the first 200 clients with non-extreme scores in the Trypophobia dataset. Clients with extreme scores, whose responses are all in the top category of the rating scale or all in the bottom category, are uninformative for constructing a measurement scale because they do not differentiate between the severity of the symptoms.

We will construct linear measures for these 200 clients and the 10 symptoms indicated in Table 7.1. Client phobia intensities and phobia symptom severities are located on the same measurement scale (conjoint measurement) using the probabilistic Rasch Rating Scale Model [1] and Marginal Maximum Likelihood Estimation (MMLE) implemented in the TAM software [33]. Each symptom is modeled to have one “symptom severity” parameter, and each client is modeled to have one “phobia intensity” parameter. There is also a 5-category rating scale from 1, “Not at all”, to 5, “Extremely”. It is modeled with 4 parameters, the Andrich thresholds, with one for each point of equal probability of adjacent categories.

Our working hypotheses are that the clients with the highest scores on the symptoms have the highest phobia intensity, and that the symptoms with the highest client scores are those most often experienced, and so are the least severe indicators of phobia. These hypotheses will need to be confirmed by phobia experts; the goal here is only to see if data from a professionally constructed measure of phobia exhibit the patterns of invariance necessary and sufficient to estimating quantity values independent of which symptoms are used to measure the clients and which clients are measured.

In MMLE, the client phobia intensity measures are modeled to have a normal distribution, locally centered on zero. Since client measures are never exactly normally distributed, this may slightly bias the measures, but this bias is inconsequential for most practical uses, as we will see. An essential aspect of the methodology presented here is that findings are always provisional. We never know the exact truth, but we do need our findings to be “good enough for government work,” which is to say, good enough to accomplish the task at hand to the required tolerance levels as efficiently as possible.

Table 7.2 shows the results of the MMLE estimation. The “Symptom Number” is the original question number in the Trypophobia instrument. The Symptom Measures estimated by TAM are in logits, reported with three decimal places. The zero point is the mean phobia intensity of this sample of clients so that roughly half the clients would be reported with negative phobia intensity.

Table 7.2 Rasch severities of symptoms for Sample 1

These logit measures, with negatives and decimals, can be difficult for a non-technical audience to understand and use. We are also intending to generalize this scale to other symptoms, which may be more or less severe than any of these symptoms. New severity extremes would cause the added symptoms to push the logit scale to more negative and/or positive values. This effect might possibly also cause the zero midpoint to be pushed up or down the scale, rendering the numeric values even less intuitive than they originally were.

Accordingly, we will linearly rescale the measurement range for the current symptoms and clients so that all the client phobia measures can be reported as convenient positive integers on a linear scale from 200 to 800. For the symptoms, these rescaled symptom calibrations are in column 5 of Table 7.2. The conversion shown in Table 7.2 is Rescaled Calibration = 52.1 * Logit Calibration + 424.2. Figure 7.1 shows histograms of the clients and symptoms on the 200–800 Scale.

Fig. 7.1
A set of 2 histograms. Above is client count versus phobia scale with high bar at 33 client count. Below is count versus phobia scale with inverse bars and high at 3 symptom count.

Distributions of the 200 clients in Sample 1 and their 10 symptoms on the 200–800 phobia scale

We are delighted to see that many clients in this sample have only milder symptoms and few have more severe symptoms. A measure of 200 corresponds to a client who reports “not at all” to all 10 symptoms. A measure of 800 corresponds to a client who reports “Extreme” to all 10 symptoms.

The rating scale has 5 categories from 1 (not at all) to 5 (extremely). Can our respondents discriminate 5 levels of intensity of their phobias? As we move from left to right in Fig. 7.2, it shows the increasing, then decreasing, probability of observing each category for this sample of clients as inferred by the Rasch model. The key feature in Fig. 7.2 is that each category in turn becomes most probable to be observed. This confirms that the persons in Sample 1 are able to discriminate 5 levels of symptom intensity. Good!

Fig. 7.2
A graph plots category probability versus client measure relative to symptom severity. All the curves are marked from 1 through 5. 1 exhibit a concave up decreased trend and 5 concave up increased trend.

Rasch-model probability curves for the Sample 1 phobia rating scale. Andrich thresholds are at the points of equal probability of adjacent categories. They are −89, −24, 23, and 90, relative to the severity of the symptom. At the vertical line, the client phobia intensity equals the symptom phobia severity

In Fig. 7.2, these four Andrich transition thresholds, which are the locations of equal probable adjacent categories, have the rescaled values of −89, −24, 23, and 90 relative to the severity of the symptom. This means that, when a client’s measure matches a symptom’s calibration, so that there is a 0 difference, a vertical line at 0 in Fig. 7.2 shows that a response in category 3 would occur 40% of the time, responses in categories 2 or 4 would each occur 25% of the time, and responses in categories 1 or 5 would each occur 5% of the time. Similar patterns of probabilities summing to 100% can be read from Fig. 7.2 for instances in which the client providing the ratings has a measure more or less above or below any given symptom.

7.2.3 Equating Sample 2 of Clients

The data analyzed for Sample 1 were collected using a specific 5-category item format. Another instrument may use a different format, such as a different rating scale. Let’s simulate this using Sample 2, comprised of another 100 clients from the Trypophobia dataset, with 8 of the 17 symptoms (see Table 7.1), four of which were also included in Sample 1.

But instead of retaining the five rating scale categories, we will dichotomize them into two categories to simulate a fictional situation in which another instrument designer approached the phobia construct from their own point of view, with their own purposes. Now, the original categories 1 and 2 are rescored 0, mild. Original categories 3, 4, and 5 become 1, severe. This data will be analyzed in eight ways using a variety of estimation methods and software packages. We will also analyze Sample 2 with the original 5 categories for comparison.

The overall intent is to demonstrate how we can free our thinking from the dictates of results that vary across samples in their specific features but which are actually statistically identical. We focus attention on the patterns of invariance from shared structures capable of supporting instrument equating, metrological traceability, and quality-assured quantity values. Here are the nine estimation methods.

  1. 1.

    Conditional Maximum Likelihood Estimation (CMLE) using “eRm” [27]. CMLE is regarded as the best estimation method because its item difficulty (symptom severity) estimates are statistically consistent [40]. However, CMLE is severely limited in the data designs that it can estimate. For instance, today’s datasets are often large and sparse. These are inestimable by CMLE, but can be estimated by other methods such as MMLE and JMLE. CMLE estimates, furthermore, are not symmetric. The distance on the measurement scale between a Symptom and a Client depends on whether Symptoms or Clients are conditioned out of the estimation equations. Here we first condition out the clients, then estimate the symptom calibrations in logits. These logits are reported in Table 7.3.

  2. 2.

    CMLE using “eRm” is then applied to the transposed Sample 2 data matrix to estimate logit measures for the clients (columns) by conditioning out the symptoms (rows). We then anchor (fix the values of) the client measures and use them to estimate the symptom calibrations using Anchored Maximum Likelihood Estimation (AMLE), also called MLE, implemented in “eRm”.

  3. 3.

    JMLE1: Joint Maximum Likelihood Estimation (JMLE) with “Winsteps” [20] is used to estimate both the Symptoms and the Clients simultaneously. JMLE estimates are symmetric, so the extra steps taken in the CMLE examples are not needed.

  4. 4.

    JMLE2: JMLE with “TAM” again estimates both the Symptoms and the Clients simultaneously, using the same method but different software.

  5. 5.

    MMLE1: MMLE with “ltm” [31] is used to estimate the symptom severities. MMLE assumes a normal distribution of clients.

  6. 6.

    MMLE2: MMLE with “TAM” is used to estimate the symptom severities.

  7. 7.

    PMLE1: Pairwise Maximum Likelihood Estimation (PMLE) with “sirt” [32] is used to estimate the symptom severities. PMLE is statistically consistent and allows great flexibility in data designs. However, its uneven use of observations in the data matrix can lead to situations in which estimates are biased.

  8. 8.

    PMLE2: PMLE with “pairwise” [9] is used to estimate the symptom severities.

  9. 9.

    JMLE5: JMLE with “Winsteps” is used to estimate symptom severities for Sample 2 data in its original 5-category format.

These nine sets of logit calibrations are shown in Table 7.3 and are plotted in Fig. 7.3. We can see immediately that the local logit scaling of each analysis produces findings that are promising indications of convergence on a common construct and scale, but which are simultaneously confusing expressions of different and not easily compared numeric values. In Fig. 7.3, each symptom is one row located vertically by the symptom’s score, which is the same for all estimation methods except the 5-category JMLE5. For convenient comparability, the 5-category JMLE calibration has been placed on the row for the symptom’s dichotomous score.

Table 7.3 Comparison of logit symptom severity Calibrations for Sample 2
Fig. 7.3
A scatterplot plots sum of client scores on symptoms versus logit calibration. The plots are for C M L E, A M L E, J M L E 1, J M L E 2, M M L E 1, M M L E 2, P M L E 1, P M L E 2, and J M L E 5 in a falling line pattern.

Plot of symptom scores against logit symptom severity calibrations for Sample 2

In Fig. 7.3, there is a hierarchy of symptom severities held in common across the estimation methods. Symptoms with lower scores are more severe and so have higher logit calibrations, but the logit ranges for the symptoms estimated by the various methods overlap so that there is no precise picture of the hierarchy of symptoms. This is despite the fact that the “logit” itself has a precise probabilistic definition. That definition, however, results in logits having different substantive expressions depending on the characteristics of the data, the estimation method, and the implementation of that estimation method [21].

To overcome the deficiencies of local logit scaling, all 9 sets of calibrations in Table 7.3 are rescaled onto the 200–800 range defined by Sample 1. The empirical justification for this rescaling is shown in Fig. 7.4, which plots the rescaled calibration for each shared symptom from Table 7.2 against the logit calibrations of Table 7.3. Symptom 4, “Panic”, has calibrations that are discordant between Sample 1 and Sample 2, so it is omitted from the trend line calculation. The trend line for CMLE is shown in Fig. 7.4. Its slope and intercept enable us to transform the CMLE logit calibrations into rescaled values conforming to the Sample 1 scale. The conversion equation for CMLE is: CMLE rescaled calibration = 41.4 * logit calibration + 509.1.

Fig. 7.4
A scatterplot plots sample 1 rescaled calibration versus logit calibrations. The plots are for C M L E, A M L E, J M L E 1, J M L E 2, M M L E 2, P M L E 1, P M L E 2, J M L E 5. A plot for A M L E is labeled Panic.

Plot of rescaled symptom calibrations for Sample 1 against logit symptom severity calibrations for Sample 2 for shared symptoms. The trend line for CMLE is shown. Symptom 4, “Panic”, is irregular and omitted from the trend line calculation

The conversion coefficients and the rescaled CMLE values are shown in Table 7.4. The same procedure is followed for the other sets of logit calibrations in Fig. 7.4 and are also shown in Table 7.4. In addition, the symptom score is also adapted in a similar way and its adapted values are listed in Table 7.4. An alternative rescaling procedure, outlined in Humphry and Andrich [10], is to equate the means and variances of the shared symptom calibrations. The procedure followed here was chosen because it emphasizes the importance of symptom selection, i.e., quality control, in the rescaling process.

Table 7.4 Comparison of rescaled symptom severity calibrations for Sample 2

In Table 7.4, notice that although the JMLE1 and JMLE2 rescaled values are almost identical, as are also the MMLE1 and MMLE2 results, the conversion coefficients are different. Different implementations of even the same estimation method construct different local measurement scales. The “Average of 6” column in Table 7.4 is the average of the six rescaled calibrations to its left. These differ by 2 units or less (well within their approximately 10-unit uncertainties) from the Average of 6 value despite the differences in conversion coefficients.

The values in Table 7.4 are plotted in Fig. 7.5, but with the Average of 6 calibration replacing its 6 components which would otherwise overlay on the plot. On the y-axis, the “Average of 6” value in Table 7.4 has substituted for the Sample 1 calibration for symptoms in Sample 2 not included in Sample 1.

Fig. 7.5
A graph plots sample 1 versus sample 2 calibrations. The graph plots the average of P M L E 1, P M L E 2, J M L E 5, and score. A regression line is straight upward with score labeled as panic.

The Sample 1 rescaled calibrations of the shared symptoms plotted against the Sample 2 rescaled calibrations. For the 3 symptoms with no Sample 1 calibration, the “Average of 6” Sample 2 calibration substitutes. The identity line is shown. Uncertainties on the x-axis are about 10 units

In Fig. 7.5, some patterns emerge. The adapted Score, symbol “*”, coincides with the Average of 6, symbol “⧫”, at the bottom of the plot, but diverges at the top of the plot (ringed). This reflects the departure of the score-to-measure ogive from a straight line. On the plot, calibrations, with an infinite range, are linear, so scores, with a bounded range, must be curvilinear. Reassuringly, JMLE5 diverges noticeably from the Average of 6 for only two symptoms (symbol “●” ringed). Dichotomizing the 5-category rating scale has not conspicuously changed the meaning of the calibrations.

One estimation method, PMLE, is slightly out of step with the other estimation methods, and its two implementations, symbols “■” and “▲”, somewhat disagree. The greatest divergence is at the top of the plot (ringed) where PMLE2 accords more closely with the Adapted Score than with the other Rasch calibrations. However, PMLE1 also outlies conspicuously near the bottom of the plot (ringed). An explanation could be that for most Rasch estimation methods, a symptom’s total score is the sufficient statistic for its symptom severity calibration. For PMLE, however, the sufficient statistics for the symptoms are the sums of counts of instances of clients who score in one category on one symptom and in another category on another symptom. A consequence is that, even for complete data where every client responds to every symptom, individual responses are used in the PMLE estimation process with different frequencies. This can bias the symptom estimates in an uneven way, meaning that PMLE is the least dependable of the methods discussed here.

Choppin [6] furthermore notes that misfit in the data can also bias PMLE calibrations, reminding us of the requirement to perform quality-control of the data, such as will be done with the mean-square statistics in Table 7.6. Other estimation methods and implementations are effectively equally dependable after rescaling, so convenience can guide the choice of method and implementation in practical situations.

There is a proviso. Some software implementations offer more than one option for the calculation of client intensity measures. The standard option, though not always available, is the maximum likelihood estimate. Another option is Warm’s mean likelihood estimate [41]. Further options include Bayesian adjustments [3, 39]. These need to be evaluated in a manner similar to Tables 7.3 and 7.4 to discover which client measures are generalizable with the implementation’s symptom conversion coefficients. This may also alter the choice of software implementation.

7.2.4 Equating Sample 3 of Clients

For Samples 1 and 2, the data are complete. We may even think that adapted raw scores are a good enough basis for a quasi-linear measurement system. However, raw scores require locally complete data. In contrast to this inefficient one-size-fits-all approach, tailoring the phobia questionnaire to the client’s situation may enable a shorter, but almost equally informative, questionnaire to be administered [4, 18, 26, 29].

We mimic this situation with Sample 3, a separate sample of 100 clients. For these clients we have a pool of 14 symptoms, indicated in Table 7.1 by “R”, “B”, “M”, “T”. Two questions, labeled “R”, are administered to all 100 clients. These are routing questions for a Flexilevel test [24]. If the client responds 1 or 2 on the 5-category rating scale to both of these symptoms, then subset B of the symptom items is administered. If the client responds 3, 4, or 5 to both symptoms, then subset T of symptoms is administered. Otherwise subset M is administered.

Subsets B, M and T are each 4 symptoms, so each client responds to 6 symptoms, selected according to the client’s phobia intensity. When this selection is applied to the 100 clients, 43 are administered subset B of symptoms, 31 have subset M, and 26 subset T. These data are analyzed by JMLE (using “Winsteps”). For comparison, the complete data for all 14 of these symptoms for the same 100 clients are also analyzed with JMLE (also using “Winsteps”).

Table 7.5 shows the results. Its rows are sorted by the Flexilevel Logit Calibration. Scores are disordered, and so are no longer a possible surrogate for client measures because missing data compromises comparability. The symptom calibrations in Table 7.5 are plotted in Fig. 7.6. Outlying symptoms 14 and 15 (ringed) are omitted from the conversion of the logits to the rescaled calibrations. The rescaled calibrations are plotted in Fig. 7.7. The symptom calibrations for the complete 14-symptom data approximate the Flexilevel symptom calibrations reasonably and more closely than the Sample 1+2 calibrations, as should be expected given that the Flexilevel and 14-symptom calibrations are derived from the same sample’s data.

Table 7.5 Rasch severities of phobia symptoms for Sample 3
Fig. 7.6
A scatterplot of sample 1 plus 2 rescaled versus sample 3 logit calibrations. Two arrows upward and downward intersect for flexilevel and 14 symptoms.

Plot of rescaled symptom calibrations for Sample 1+2 against Sample 3 Flexilevel and 14-symptom logit severity calibrations for Sample 3. Trend lines for Flexilevel (arrowed) and 14-Symptoms are shown. Ringed estimates for symptoms 14 and 15 are far from the trend lines and omitted for their calculations. Uncertainty on the y-axis is about 5 units and on the x-axis is about .2 logits

Fig. 7.7
A scatterplot plots rescaled 14 symptoms and sample1 plus 2 versus rescaled flexilevel. The plots is obtained for 14 symptoms and sample 1 plus 2 along with a regression line that extends upward.

Plot of rescaled symptom calibrations for Sample 3. The rescaled calibrations for all 14 symptoms and for Sample 1+2 are plotted against the rescaled Flexilevel symptom calibrations. The ringed points are for symptoms 14 and 15. The diagonal line is the identity line. Uncertainty on both axes is about 5 units

The purpose behind the Flexilevel approach is to reduce the response burden on the clients without unduly reducing the measurement effectiveness of the instrument. Having established that the adaptively reduced data set defines the same phobia scale as the complete data set, we can ask whether the client measures are statistically identical across the two instrument administration methods.

Figure 7.8 plots the rescaled Flexilevel client intensity measures against their 14-symptom intensity measures. The reduction from 14 symptoms to 6 symptoms for each client shows a loss of some precision, but the trend is very strong, with most pairs of measures falling near the identity line.

Fig. 7.8
A scatterplot plots rescaled 14 symptoms and sample 1 plus 2 versus rescaled flexilevel. The plot is obtained for 14 symptoms and sample 1 plus 2 along with a regression line that extends upwards.

Scatter plot of 100 client calibrations for Sample 3. The scatter reflects the loss of precision when reducing from 14 symptoms to 6 symptoms for measuring each client. Uncertainty on the x-axis is 6 to 12 units and on the y-axis is 8 to 11 units

Figure 7.9 shows the effectiveness of the Flexilevel routing, with 82% of clients routed correctly. The left vertical arrows at about 430 mark the top of the low range of scale values targeted for the B (both initial items rated low) condition. The right vertical arrows at about 520 mark the top of the middle range of scale values targeted for the M condition (mixed low and high ratings on the first two items). The T condition (both initial items rated high) extends from 520 to the top of the scale.

Fig. 7.9
The plot for felxilevel client calibrations. The plot exhibits three levels from top to bottom, The plots are made for T level, M level, and B level. For all the levels values are estimated.

Plot of Flexilevel client calibrations for Sample 3 showing the impact of the two routing symptoms

Only 18% of the clients were routed to a less targeted set of symptoms, none by more than one level. That is, five clients in the B range have measures over 430, and five at the M level have measures lower than 430. Four more at the M level have measures higher than 520, above the intended range, and four at the T level have measures lower than 520, below the expected upper range. The Flexilevel process has accomplished its objective. Clients have not been burdened with the need to respond to questions irrelevant to their conditions, and this has been done without compromising the comparability of the resulting measurements.

7.2.5 Virtual Equating

In Fig. 7.4 the rescaling conversion was performed mathematically. We can also do it visually, a process called “Virtual Equating,” which is an application of substantive variable mapping [35, 36] and construct mapping [42] methods introduced by Wright and Stone [44] and Wright and Masters [43]. This approach may be attractive to those who think visually, or who desire to work with closer involvement in the qualitative features of the measured construct.

Figure 7.10 shows the 10 symptoms from Sample 1 at their rescaled values. Next to them are the 8 symptoms from Sample 2 positioned on their own logit phobia scale from the JMLE1 analysis. Five of the symptoms are the same, so I have used these to align the Sample 1 and Sample 2 phobia scales by eye. Keep in mind that the positions of these symptoms on the scale are estimated from the client responses and are not in any way a consequence of analyst inputs or researcher influence. As discussed above, merely numeric differences in the estimates obtained from separate samples may conceal the presence of statistically identical patterns of structural invariance. The boxes around four of the pairs of item names indicate the symptoms used for the virtual equating (Vomit, Itchiness, Anxious, Uneasy). The ringed symptom (Panic) is not used because it is too different (for unexplained reasons that may have to do with differences between the two samples of clients). This symptom is idiosyncratic in some way, so it is dropped from the virtual equating.

Fig. 7.10
A tabular representation. It lists rescaled sample 1 plus logit 2 gives rescaled combined. The values are listed for all. The rescaled sample 1 from 600 plus through 400 plus top through the bottom.

The 10 symptoms for Sample 1 with at their rescaled calibrations and the 8 symptoms for Sample 2 with their logit calibrations equated onto one scale

The end result of laying out the symptoms on the scales calibrated by the separate samples in Fig. 7.10 is confirmation of the conversion in Table 7.5. In Fig. 7.10, symptoms for Sample 2 that were not included in the Sample 1 scale have been aligned with the Sample 1 symptoms (arrowed). The combined phobia scale for Samples 1+2 is shown in the last column of Fig. 7.10.

7.2.6 Confirming the Phobia Scale

The 17 rescaled symptom estimates have now been placed onto a combined scale locating together in a shared frame of reference the ten symptoms from Table 7.2, the three symptoms from Table 7.4, and four symptoms from Table 7.5. Let’s call this combined scale the “Benchmark Phobia Scale”. It comprises three subsets of symptoms which were analyzed with different estimation methods, implemented with different software, and which used different scoring mechanisms.

The intention behind the construction of the Benchmark Phobia Scale is that the scale surmounts the specific symptoms chosen, the specific estimation method or software used, and the specific format of the responses to the questions. This intention does nothing but follow through to its logical and practical consequences the decision to employ an identified measurement model [34] requiring the persistent display of structural invariances across samples and instruments.

Let us check how well we have accomplished this. We have the datasets for Samples 1, 2, 3 containing different symptoms and samples of clients. Two datasets, Samples 1 and 3, have 5-category rating scales, but we have allowed each dataset to define the measurement characteristics of its own rating scale. Sample 2 has a 2-category, dichotomous, rating scale. We will now analyze all three datasets together using JMLE implemented in “Facets” [19]. The combined dataset is thus 17 symptoms and 200 + 100 + 100 = 400 clients. “Facets” allows the same symptom to be analyzed with different rating scales. In this analysis, the Rasch Grouped Rating Scale Model (GRSM, [25]) is used. Samples 1 and 3 are modeled to share the same 5-category rating scale. Sample 2 is modeled to use its own dichotomous rating scale.

Table 7.6 lists the Benchmark Phobia Scale calibrations, shown as the “Bench Scale” column, obtained by analysis of Samples 1, 2 and 3 separately and extracted from Tables 7.2, 7.4 and 7.5 in which all the calibrations had been rescaled to accord with Sample 1. The “Combined Samples” column contains the calibrations from the “Facets” analysis of all three Samples combined, rescaled using the method of Fig. 7.4. The 17 symptoms and three samples were extracted from the complete original Trypophobia dataset. This original dataset of 17 symptoms and 582 clients is analyzed with JMLE using “Winsteps”. Its calibrations, rescaled in the same way, are shown in the “Original Dataset” column. These are plotted in Fig. 7.11.

Table 7.6 Comparison of symptom severity calibrations for different datasets
Fig. 7.11
A scatterplot for combined and original scales versus 17 symptom benchmark scale. A regression line extends upward linearly. A plot original at (620, 570) is encircled.

The Combined and Original calibrations plotted against the Benchmark Scale calibrations. The identity line is shown. The ringed outlier is Symptom 9, “Urge”. Uncertainty on the x-axis is 4 to 16 units and on the y-axis is 4 to 7 units

The very high correlations shown in Table 7.6 and reasonable dispersions around the identity line in Fig. 7.11 (relative to their uncertainties of roughly 10 units) are reassuring. They indicate that a usefully definitive Benchmark Scale has already been achieved. The Original Dataset contains 8976 ratings for non-extreme clients. In the three Samples, there are 3400 ratings, i.e., 38% of the Original Dataset. Despite the deliberate manipulation and abbreviation of the original data--manipulations undertaken with the goal of demonstrating how the calibrated construct might persistently display its characteristic identity across variations in samples, instruments, and estimation--the original measurement system has been preserved in such way that it can be generalized beyond these datasets.

In Table 7.6, the mean-square fit statistics for the Combined Samples analysis are all acceptable. Each mean-square is its chi-squared statistic divided by its degrees of freedom, so that its expectation is 1.0. Mean-squares in the range 0.5 – 1.5 are usually acceptable [22]. In the Original Dataset analysis, only one mean-square is alarming, 2.32 for Symptom 9, “Urge”. This indicates there is a major source of substantive inconsistency in the data for this symptom. Since the mean-square for this symptom in the Combined Samples analysis is reasonable, this statistic must be associated with responses in a subset of client ratings that was not part of the three samples.

Further, this symptom is the only conspicuous outlier in Fig. 7.11. In Table 7.6 and Fig. 7.12, the Benchmark severity of Symptom 9, Urge, is 609, but the Original severity is 571, 38 units, (almost four uncertainty ranges) less. In the anomalous subset of ratings, there is a greater propensity to rate “Urge” highly than in our data subsamples. In fact, investigation of the analytic details reveals that the 17 of the 50 most unexpected ratings in the Original analysis are for the symptom “Urge”. In every case, “Urge” has been rated much higher than expected. Though professionally informed opinions on spe-cific client cases may lead to insights as to why these inconsistent observations occurred, these results suggest that “Urge” is not a stable symptom of phobia, and should be removed from the Benchmark Scale.

In summary, there are already indications that improvements can be made to the choice of symptoms and to the severity values on the benchmark scale. Quality control, such as this, is an essential and continuing feature of any measurement system.

7.3 Measuring with the Benchmark Scale

Suppose that we have a new instrument containing 5 of the symptoms and a newly defined rating scale. As has been demonstrated, calibrations can be obtained from any subset of the 17 symptoms of the Benchmark Scale and further symptoms can be added, temporarily or permanently. We want to put the new 5-symptom instrument on the Benchmark scale. Here are some procedures:

  1. 1.

    Administer the new instrument to 20 or more clients, preferably 50 [15], then follow the procedure depicted in Tables 7.3 and 7.4. Analyze these data with your chosen software. Output logits and obtain the conversion coefficients to convert local logits onto the Benchmark scale.

If a client responds to all 5 questions, then ratings can be summed. Use the logit values for the symptoms and the Andrich thresholds of the rating scale. Then apply AMLE to each possible raw score. If the chosen software does not support this directly, then AMLE can be implemented in Excel [17]. Finally apply the conversion coefficients to obtain a score-to-measure table similar to Table 7.7. If a client responds to some or all of the 5 symptom questions, the client’s measure can also be computed immediately by applying AMLE only to those symptoms and the raw score. This can be implemented in a smartphone app or such like.

Table 7.7 Client scores and Benchmark scale intensity calibrations for 5 phobia symptoms
Fig. 7.12
A tabular format of the symptom severities. It lists the benchmark, combined samples, and original data set scales. Each range from 400 plus through 620 plus and different symptom severities.

The symptom severities from Table 7.6 shown graphically

  1. 2.

    If a score-to-measure table similar to Table 7.7 is needed immediately, before any clients have responded, then it can be constructed by choosing reasonable conversion coefficients from Table 7.4 or similar. Convert the benchmark scale values for the chosen symptoms and rating from, say, Table 7.4 and Fig. 7.2, back to logits, then apply the AMLE procedure in 1 above.

7.3.1 A Phobia Intensity “Ruler”

Table 7.4 and Fig. 7.2 further enable us to construct a measurement ruler, known as a “Keyform” [14, 16], for phobia intensity. A keyform for the 8 symptoms scaled in Sample 2 is shown in Fig. 7.13. This corresponds to the dichotomized rating scale as analyzed by JMLE1 in Table 7.4. The symptoms are positioned at their severities along the Benchmark phobia scale. “O” indicates the client affirmed this symptom. “X” indicates the client did not affirm this symptom. The approximate client intensity measure is shown by the vertical line.

Fig. 7.13
An illustration that lists symptoms based on the measurements. It shows uneasy at 410, bumps at 442, skin at 470, anxious at 480, itchiness at 530, panic at 543, vomit at 580, and crying at 610.

Measurement “Keyform” for phobia based on 8 dichotomized symptoms. The vertical line shows the measurement for a client who affirmed the ringed ratings. Uncertainty in the client measures is about 35 units

Figure 7.14 shows the same data with the 5-category rating scale as analyzed by JMLE5 in Table 7.4. The x-axis is the Phobia Intensity Scale. The y-axis lists the symptoms. Along each row, the rating-scale categories for each symptom are positioned at their calibrations on the Phobia Intensity Scale. Categories 2, 3, 4 for each symptom are positioned at their locations of maximum probability as shown in Fig. 7.2. The maximum probabilities for categories 1, “Not at all”, and 5, “Extreme”, are at opposite infinities. This is impractical, so the numerals categories 1 and 5 are positioned where the expected scores on each symptom are 1.25 and 4.75, respectively, on the 1–5 rating scale. These values are chosen such that their locations on the Phobia Scale are distant from the intensity measures for categories 2 and 4, but not unreasonably far away. The numerals 1 and 5 at the extremes of each row remind us that those ratings continue out to infinity.

Fig. 7.14
An illustration of symptoms versus phobia intensity. Some encircled values and frequencies are shown. A vertical line is at 480 for crying and skin. Values are estimated.

Measurement “Keyform” for phobia based on 8 5-category symptoms. The x-axis shows the client phobia intensity calibrations. For each symptom, the expected intensity calibration for each category is shown by the category number. The vertical line shows the measurement for a client with the ringed ratings. Uncertainty in the client measures is about 25 units

Figure 7.14 can be used immediately as a measuring device. In this example, the rings around the client’s ratings allows the therapist to determine by eye a visual sense of a client’s measure. A vertical line drawn through the middle of the ratings approximates a measure of the intensity of the client’s phobia. Uncertainty values or a 95% confidence interval define the range outside of which unexpected responses are outliers, horizontally, such as “2” for “Skin”. These exceptions may be useful for further diagnosis of the client’s condition.

Notice that the precision of the estimated measure for this client on the 5-category rating scale is higher than for the dichotomized rating scale in Fig. 7.13, so that smaller changes in phobia intensity can be tracked on the 5-category scale than on the dichotomous scale. The diagnostic power (quality control) of the 5-category scale is also higher.

7.4 Discussion

In this chapter we have seen how social science variables can be constructed, blended and extended, and then refined into a useable measurement system. Using different segments of the Trypophobia dataset, we have demonstrated how the findings from different samples of clients and different subsets of symptoms can be combined to construct a comprehensive measurement scale. In the course of this, we discovered that the measurement scale can surmount the specific response structure employed to communicate with the clients. We also showed that the measurement scale accommodates different estimation methods and different software implementations of those estimation methods without distorting the meaning of the unit quantity.

There is a paradoxical finding in Table 7.4. Two estimation methods, CMLE and PMLE, are considered to be of higher quality according to statistical theory [2, 40] and in practical application [23]. They accordingly might be expected to agree with each other, and to disagree with all the rest of the estimation methods. However, after rescaling, CMLE agrees with all the other estimation methods; only PMLE does not agree, and it does not seem to provide superior estimates.

Following the procedures described here, another Phobia Questionnaire with its dataset of responses can be aligned with the Benchmark Scale, perhaps augmenting the Benchmark Scale. For a brand-new questionnaire, the graphical depiction of the Benchmark Scale in Fig. 7.14 can be used to obtain linear client measures immediately. Then, after 20 or more relevant clients have responded, a more precise alignment between the new questionnaire and the Benchmark scale can be made.

The linear Phobia measures are ideal for statistical analysis and for communication between therapists, researchers, and clients. Investigations into the causes of phobias and their treatment becomes independent of the specifics of the phobias and the devices used to collect data about them. One more unit on the Benchmark phobia scale means the same amount of change, regardless of the intensity of phobia experienced. If results of different studies using different instruments measuring the same thing were aligned and expressed in a common language, social science might at last be positioned to facilitate the kind of rapid advances the physical sciences have demonstrated are possible when we are able to learn quickly and effectively from each other’s efforts.