SOLUTION: Sources of test bias. Explain a source of test bias that can threaten validity of the test results. Include a peer-reviewed article that discusses test bias. What steps can be taken to reduce the risk of bias?

Sources of test bias. Explain a source of test bias that can threaten validity of the test results. Include a peer-reviewed article that discusses test bias. What steps can be taken to reduce the risk of bias?

210

C H A P T E R 6

Ability Testing: Group Tests and

Controversies

T he practical success of early intelligence scales such as the 1905 Binet-Simon test motivated psychologists and educators to develop instruments that could be administered simultane- ously to large numbers of examinees. Test developers were quick to realize that group tests

allowed for the efficient evaluation of dozens or hundreds of examinees at the same time. As reviewed in an earlier chapter, one of the first uses of group tests was for screening and assignment of military personnel during World War I. The need to quickly test thousands of Army recruits inspired psychol- ogists in the United States, led by Robert M. Yerkes, to make rapid advances in psychometrics and test development (Yerkes, 1921). Many new applications followed immediately—in education, industry, and other fields. In Topic 6A, Group Tests of Ability and Related Concepts, we introduce the reader to the varied applications of group tests and also review a sampling of typical instruments. In addition, we explore a key question raised by the consequential nature of these tests—can examinees boost their scores significantly by taking targeted test preparation courses? This is but one of many unexpected issues raised by the widespread use of group tests. In Topic 6B, Test Bias and Other Controversies, we continue a reflective theme by looking into test bias and other contentious issues in testing.

NATURE, PROMISE, AND PITFALLS OF GROUP TESTS

Group tests serve many purposes, but the vast majority can be assigned to one of three types: ability, aptitude, or achievement tests. In the real world, the distinction among these kinds of tests often is quite fuzzy (Gregory, 1994a). These instruments differ mainly in their functions and

TOPIC 6A Group Tests of Ability and Related Concepts

Nature, Promise, and Pitfalls of Group Tests

Group Tests of Ability

Multiple Aptitude Test Batteries

Predicting College Performance

Postgraduate Selection Tests

Educational Achievement Tests

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 211

a hundredth of the time needed to administer the same test individually. Again, in certain comparative studies, e.g., of the effects of a week’s vacation upon the mental efficiency of school children, it becomes imperative that all S’s should take the tests at the same time. On the other hand, there are almost sure to be some S’s in every group that, for one rea- son or another, fail to follow instructions or to execute the test to the best of their abil- ity. The individual method allows E to detect these cases, and in general, by the exercise of personal supervision, to gain, as noted above, valuable information concerning S’s attitude toward the test.

In sum, group testing poses two interrelated risks: (1) some examinees will score far below their true ability, owing to motivational problems or dif- ficulty following directions and (2) invalid scores will not be recognized as such, with undesirable consequences for these atypical examinees. There is really no simple way to entirely avoid these risks, which are part of the trade-off for the efficiency of group testing. However, it is possible to minimize the potentially negative consequences if examiners scrutinize very low scores with skepticism and rec- ommend individual testing for these cases.

We turn now to an analysis of group tests in a variety of settings, including cognitive tests for schools and clinics, placement tests for career and military evaluation, and aptitude tests for college and postgraduate selection.

GROUP TESTS OF ABILITY

Multidimensional Aptitude Battery-II (MAB-II)

The Multidimensional Aptitude Battery-II ( MAB-II; Jackson, 1998) is a recent group intelligence test de- signed to be a paper-and-pencil equivalent of the WAIS-R. As the reader will recall, the WAIS-R is a highly respected instrument (now replaced by the WAIS-III), in its time the most widely used of the available adult intelligence tests. Kaufman (1983) noted that the WAIS-R was “the criterion of adult intelligence, and no other instrument even comes

applications, less so in actual test content. In brief, ability tests typically sample a broad assortment of proficiencies in order to estimate current intellectual level. This information might be used for screening or placement purposes, for example, to determine the need for individual testing or to establish eligi- bility for a gifted and talented program. In contrast, aptitude tests usually measure a few homogeneous segments of ability and are designed to predict fu- ture performance. Predictive validity is foundational to aptitude tests, and often they are used for institu- tional selection purposes. Finally, achievement tests assess current skill attainment in relation to the goals of school and training programs. They are designed to mirror educational objectives in reading, writing, math, and other subject areas. Although often used to identify educational attainment of students, they also function to evaluate the adequacy of school edu- cational programs.

Whatever their application, group tests differ from individual tests in five ways:

• Multiple-choice versus open-ended format • Objective machine scoring versus examiner

scoring • Group versus individualized administration • Applications in screening versus remedial

planning • Huge versus merely large standardization

samples

These differences allow for great speed and cost efficiency in group testing, but a price is paid for these advantages.

Although the early psychometric pioneers embraced group testing wholeheartedly, they rec- ognized fully the nature of their Faustian bargain: Psychologists had traded the soul of the individual examinee in return for the benefits of mass testing. Whipple (1910) summed up the advantages of group testing but also pointed to the potential perils:

Most mental tests may be administered either to individuals or to groups. Both methods have advantages and disadvantages. The group method has, of course, the particular merit of economy of time; a class of 50 or 100 chil- dren may take a test in less than a fiftieth or

212 Chapter 6 • Ability Testing: Group Tests and Controversies

Spatial subtest on the MAB-II. In the Spatial subtest, examinees must mentally perform spatial rotations of figures and select one of five possible rotations presented as their answer (Figure 6.1). Only mental rotations are involved (although “flipped-over” ver- sions of the original stimulus are included as distrac- tor items). The advanced items are very complex and demanding.

The items within each of the 10 MAB-II sub- tests are arranged in order of increasing difficulty, beginning with questions and problems that most adolescents and adults find quite simple and pro- ceeding upward to items that are so difficult that very few persons get them correct. There is no pen- alty for guessing and examinees are encouraged to respond to every item within the time limit. Unlike the WAIS-R in which the verbal subtests are untimed power measures, every MAB-II subtest incorporates elements of both power and speed: Examinees are al- lowed only seven minutes to work on each subtest. Including instructions, the Verbal and Performance portions of the MAB-II each take about 50 minutes to administer.

The MAB-II is a relatively minor revision of the MAB, and the technical features of the two versions are nearly identical. A great deal of psy- chometric information is available for the original version, which we report here. With regard to reli- ability, the results are generally quite impressive. For example, in one study of over 500 adolescents rang- ing in age from 16 to 20, the internal consistency re- liability of Verbal, Performance, and Full Scale IQs was in the high .90s. Test–retest data for this instru- ment also excel. In a study of 52 young psychiatric patients, the individual subtests showed reliabilities that ranged from .83 to .97 (median of .90) for the Verbal scale and from .87 to .94 (median of .91) for the Performance scale (Jackson, 1984). These re- sults compare quite favorably with the psychometric standards reported for the WAIS-R.

Factor analyses of the MAB-II are broadly supportive of the construct validity of this instru- ment and its predecessor (Lee, Wallbrown, & Blaha, 1990). Most recently, Gignac (2006) examined the factor structure of the MAB-II using a series of con- firmatory factor analyses with data on 3,121 individ- uals reported in Jackson (1998). The best fit to the

close.” However, a highly trained professional needs about 11/2 hours just to administer the Wechsler adult test to a single person. Because professional time is at a premium, a complete Wechsler intelli- gence assessment—including administration, scor- ing, and report writing—easily can cost hundreds of dollars. Many examiners have long suspected that an appropriate group test, with the attendant advan- tages of objective scoring and computerized narra- tive report, could provide an equally valid and much less expensive alternative to individual testing for most persons.

The MAB-II was designed to produce subtests and factors parallel to the WAIS-R but employing a multiple-choice format capable of being computer scored. The apparent goal in designing this test was to produce an instrument that could be adminis- tered to dozens or hundreds of persons by one ex- aminer (and perhaps a few proctors) with minimal training. In addition, the MAB-II was designed to yield IQ scores with psychometric properties simi- lar to those found on the WAIS-R. Appropriate for examinees from ages 16 to 74, the MAB-II yields 10 subtest scores, as well as Verbal, Performance, and Full Scale IQs.

Although it consists of original test items, the MAB-II is mainly a sophisticated subtest-by-subtest clone of the WAIS-R. The 10 subtests are listed as follows:

Verbal Performance

Information Digit Symbol Comprehension Picture Completion Arithmetic Spatial Similarities Picture Arrangement Vocabulary Object Assembly

The reader will notice that Digit Span from the WAIS-R is not included on the MAB-II. The reason for this omission is largely practical: There would be no simple way to present a Digit-Span-like subtest in paper-and-pencil format. In any case, the omission is not serious. Digit Span has the lowest correlation with overall WAIS-R IQ, and it is widely recognized that this subtest makes a minimal contribution to the measurement of general intelligence.

The only significant deviation from the WAIS-R is the replacement of Block Design with a

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 213

Intelligence factor independent of its contribution to the general factor.

Other researchers have noted the strong con- gruence between factor analyses of the WAIS-R (with Digit Span removed) and the MAB. Typically,

data was provided by a nested model consisting of a first-order general factor, a first-order Verbal Intel- ligence factor, and a first-order Performance Intel- ligence factor. The one caveat of this study was that Arithmetic did not load specifically on the Verbal

FIGURE 6.1 Demonstration Items from Three Performance Tests of the Multidimensional

Aptitude Battery-II (MAB)

Source: Reprinted with permission from Jackson, D. N. (1984a). Manual for the Multidimensional

Aptitude Battery. Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265–1285.

214 Chapter 6 • Ability Testing: Group Tests and Controversies

The MAB-II shows great promise in research, career counseling, and personnel selection. In ad- dition, this test could function as a screening instrument in clinical settings, as long as the exam- iner views low scores as a basis for follow-up testing with an individual intelligence test. Examiners must keep in mind that the MAB-II is a group test and, therefore, carries with it the potential for misuse in individual cases. The MAB-II should not be used in isolation for diagnostic decisions or for placement into programs such as classes for intellectually gifted persons.

A Multilevel Battery: The Cognitive Abilities Test (CogAT)

One important function of psychological testing is to assess students’ abilities that are prerequisite to traditional classroom-based learning. In designing tests for this purpose, the psychometrician must contend with the obvious and nettlesome problem that school-aged children differ hugely in their intel- lectual abilities. For example, a test appropriate for a sixth grader will be much too easy for a tenth grader, yet impossibly difficult for a third grader.

The answer to this dilemma is a multilevel battery, a series of overlapping tests. In a multi- level battery, each group test is designed for a spe- cific age or grade level, but adjacent tests possess some common content. Because of the overlapping content with adjacent age or grade levels, each test possesses a suitably low floor and high ceiling for proper assessment of students at both extremes of ability. Virtually every school system in the United States uses at least one nationally normed multilevel battery.

The Cognitive Abilities Test (CogAT) is one of the best school-based test batteries in current use (Lohman & Hagen, 2001). A recent revision of the test is the CogAT Multilevel Edition, Form 6, re- leased in 2001. Norms for 2005 also are available. We discuss this instrument in some detail.

The CogAT evolved from the Lorge-Thorndike Intelligence Tests, one of the first group tests of

separate Verbal and Performance factors emerge for both tests (Wallbrown, Carmin, & Barnett, 1988). In a large sample of inmates, Ahrens, Evans, and Barnett (1990) observed validity-confirming changes in MAB scores in relation to education level. In general, with the possible exception that Arithmetic does not con- tribute reliably to the Verbal factor, there is good justi- fication for the use of separate Verbal and Performance scales on this test.

In general, the validity of this test rests upon its very strong physical and empirical resemblance to its parent test, the WAIS-R. Correlational data be- tween MAB and WAIS-R scores are crucial in this regard. For 145 persons administered the MAB and WAIS-R in counterbalanced fashion, correlations between subtests ranged from .44 (Spatial/Block Design) to .89 (Arithmetic and Vocabulary), with a median of .78. WAIS-R and MAB IQ correlations were very healthy, namely, .92 for Verbal IQ, .79 for Performance IQ, and .91 for Full Scale IQ (Jackson, 1984a). With only a few exceptions, correlations between MAB and WAIS-R scores exceed those be- tween the WAIS and the WAIS-R. Carless (2000) reported a similar, strong overlap between MAB scores and WAIS-R scores in a study of 85 adults for the Verbal, Performance, and Full Scale IQ scores. However, she found that 4 of the 10 MAB subtests did not correlate with the WAIS-R subscales they were designed to represent, suggesting caution in using this instrument to obtain detailed information about specific abilities.

Chappelle et al. (2010) obtained MAB-II scores for military personnel in an elite training program for AC-130 gunship operators. The officers who passed training (N = 59) and those who failed training (N = 20) scored above average (mean Full Scale IQs of 112.5 and 113.6, respectively), but there were no significant differences between the two groups on any of the test indices. This is a curious result insofar as IQ typically demonstrates at least mild predictive potential for real world vo- cational outcomes. Further research on the MAB- II as a predictor of real world results would be desirable.

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 215

Quantitative Battery appraise quantitative skills important for mathematics and other disciplines. The Nonverbal Battery can be used to estimate cognitive level of students with limited reading skill, poor English proficiency, or inadequate edu- cational exposure.

For each CogAT subtest, items are ordered by difficulty level in a single test booklet. However, entry and exit points differ for each of eight over- lapping levels (A through H). In this manner, grade- appropriate items are provided for all examinees.

The subtests are strictly timed, with limits that vary from 8 to 12 minutes. Each of the three batteries can be administered in less than an hour. However, the manual recommends three successive testing days for younger children. For older children, two batteries should be administered the first day, with a single testing period the next.

intelligence intended for widespread use within school systems. The CogAT is primarily a measure of scholastic ability but also incorporates a nonver- bal reasoning battery with items that bear no direct relation to formal school instruction. The two pri- mary batteries, suitable for students in kindergarten through third grade, are briefly discussed at the end of this section. Here we review the multilevel edi- tion intended for students in 3rd through 12th grade.

The nine subtests of the multilevel CogAT are grouped into three areas: Verbal, quantitative, and nonverbal, each including three subtests. Representative items for the subtests of the Co- gAT are depicted in Figure 6.2. The tests on the Verbal Battery evaluate verbal skills and reason- ing strategies (inductive and deductive) needed for effective reading and writing. The tests on the

Verbal Battery

1. Verbal Classification

Circle the item below that belongs with these three:

milk butter cheese

A. eggs B. yogurt C. grocery

D. bacon E. recipe

2. Sentence Completion

Circle the word below that best completes this sentence:

Fish _____________ in the ocean.

A. sit B. next C. fly

D. swim E. climb

3. Verbal Analogies

Circle the word that best fits this analogy:

Right S Left : Top S

A. Side B. Out C. Wrong

D. On E. Bottom

Quantitative Battery

4. Quantitative Relations

Circle the choice that depicts the relationship between

I and II:

I. 6/2 + 1

II. 9/3 − 1

A. I is greater than II B. I is equal to II

C. I is less than II

5. Number Series

Circle the number below that comes next in this series:

1 11 6 16 11 21 16

A. 31 B. 16 C. 26 D. 6 E. 11

6. Equation Building

Circle the choice below that could be derived from these:

1 2 4 + −

A. −1 B. 7 C. 0 D. 1 E. −3

FIGURE 6.2 Subtests and Representative Items of the Cognitive Abilities Test, Form 6

216 Chapter 6 • Ability Testing: Group Tests and Controversies

Nonverbal Battery

7. Figure Classification

Circle the item below that belongs with these three figures:

A B C D E

8. Figure Analogies

Circle the figure below that best fits with this analogy:

A B C D E

9. Figure Analysis

Circle the choice below that fits this paper folding and hole punching:

A B C D E

Note: These items resemble those on the CogAT 6. Correct answers: 1: B. yogurt (the only dairy product). 2: D. swim

(fish swim in the ocean). 3: E. bottom (the opposite of top). 4: A. I is greater than II (4 is greater than 2). 5: C. 26 (the

algorithm is add 10, subtract 5, add 10 . . .). 6: A. −1 (the only answer that fits) 7: A (four-sided shape that is filled in).

8: D (same shape, bigger to smaller). 9: E (correct answer).

FIGURE 6.2 continued

www.ebook3000.com

http://www.ebook3000.org

Topic 6A • Group Tests of Ability and Related Concepts 217

Ansorge (1985) has questioned whether all three batteries are really necessary. He points out that correlations among the Verbal, Quantitative, and Nonverbal batteries are substantial. The median values across all grades are as follows:

Verbal and Quantitative .78 Nonverbal and Quantitative .78 Verbal and Nonverbal .72

Since the Quantitative battery offers little unique- ness, from a purely psychometric point of view there is no justification for including it. Nonetheless, the test authors recommend use of all batteries in hopes that differences in performance will assist teachers in remedial planning. However, the test authors do not make a strong case for doing this.

A study by Stone (1994) provides a notable justification for using the CogAT as a basis for stu- dent evaluation. He found that CogAT scores for 403 third graders provided an unbiased prediction of student achievement that was more accurate than teacher ratings. In particular, teacher ratings showed bias against Caucasian and Asian Ameri- can students by underpredicting their achievement scores.

Raven’s Progressive Matrices (RPM)

First introduced in 1938, Raven’s Progressive Matrices (RPM) is a nonverbal test of inductive reasoning based on figural stimuli (Raven, Court, & Raven, 1986, 1992). This test has been very popular in basic research and is also used in some institutional set- tings for purposes of intellectual screening.

RPM was originally designed as a measure of Spearman’s g factor (Raven, 1938). For this reason, Raven chose a special format for the test that pre- sumably required the exercise of g. The reader is re- minded that Spearman defined g as the “eduction of correlates.” The term eduction refers to the process of figuring out relationships based on the perceived fundamental similarities between stimuli. In partic- ular, to correctly answer items on the RPM, examin- ees must identify a recurring pattern or relationship between figural stimuli organized in a 3 × 3 matrix. The items are arranged in order of increasing diffi- culty, hence the reference to progressive matrices.

Raw scores for each battery can be trans- formed into an age-based normalized standard score with mean of 100 and standard deviation of 15. In addition, percentile ranks and stanines for age groups and grade level are also available. Interpola- tion was used to determine fall, winter, and spring grade-level norms.

The CogAT was co-normed ( standardized concurrently) with two achievement tests, the Iowa Tests of Basic Skills and the Iowa Tests of Educational Development. Concurrent standardiza- tion with achievement measures is a common and desirable practice in the norming of multilevel intel- ligence tests. The particular virtue of joint norming is that the expected correspondence between intel- ligence and achievement scores is determined with great precision. As a consequence, examiners can more accurately identify underachieving students in need of remediation or further assessment for po- tential learning disability.

The reliability of the CogAT is exceptionally good. In previous editions, the Kuder-Richardson-20 reliability estimates for the multilevel batteries av- eraged .94 (Verbal), .92 (Quantitative), and .93 (Nonverbal) across all grade levels. The six-month test–retest reliabilities for alternate forms ranged from .85 to .93 (Verbal), .78 to .88 (Quantitative), and .81 to .89 (Nonverbal).

The manual provides a wealth of information on content, criterion-related, and construct validity of the CogAT; we summarize only the most perti- nent points here. Correlations between the CogAT and achievement batteries are substantial. For ex- ample, the CogAT verbal battery correlates in the .70s to .80s with achievement subtests from the Iowa Tests of Basic Skills.

The CogAT batteries predict school grades reasonably well. Correlations range from the .30s to the .60s, depending on grade level, sex, and eth- nic group. There does not appear to be a clear trend as to which battery is best at predicting grade point average. Correlations between the CogAT and indi- vidual intelligence tests are also substantial, typically ranging from .65 to .75. These findings speak well for the construct validity of the CogAT insofar as the Stanford-Binet is widely recognized as an excellent measure of individual intelligence.

218 Chapter 6 • Ability Testing: Group Tests and Controversies

reliability coefficients of .80 to .93 are typical. How- ever, for preteen children, reliability coefficients as low as .71 are reported. Thus, for younger subjects, RPM may not possess sufficient reliability to war- rant its use for individual decision making.

Factor-analytic studies of the RPM provide little, if any, support for the original intention of the test to measure a unitary construct (Spearman’s g factor). Studies of the Coloured Progressive Matrices reveal three orthogonal factors (e.g., Carlson & Jensen, 1980). Factor I consists largely of very diffi- cult items and might be termed closure and abstract reasoning by analogy. Factor II is labeled pattern completion through identity and closure. Factor III consists of the easiest items and is defined as simple pattern completion (Carlson & Jensen, 1980). In sum, the very easy and the very hard items on the Coloured Progressive Matrices appear to tap differ- ent intellectual processes.

The Advanced Progressive Matrices breaks down into two factors that may have separate pre- dictive validities (Dillon, Pohlmann, & Lohman, 1981). The first factor is composed of items in which the solution is obtained by adding or subtracting patterns (Figure 6.3a). Individuals performing well on these items may excel in rapid decision making and in situations where part–whole relationships must be perceived. The second factor is composed of items in which the solution is based on the abil- ity to perceive the progression of a pattern (Figure 6.3b). Persons who perform well on these items may possess good mechanical ability as well as good skills for estimating projected movement and performing mental rotations. However, the skills represented by each factor are conjectural at this point and in need of independent confirmation.

A huge body of published research bears on the validity of the RPM. The early data are well summarized by Burke (1958), while later findings are compiled in the current RPM manuals (Raven & Summers, 1986; Raven, Court, & Raven, 1983, 1986, 1992). In general, validity coefficients with achieve- ment tests range from the .30s to the .60s. As might be expected, these values are somewhat lower than found with more traditional (verbally loaded) in- telligence tests. Validity coefficients with other intelligence tests range from the .50s to the .80s.

Raven’s test is actually a series of three differ- ent instruments. Much of the confusion about valid- ity, factorial structure, and the like stems from the unexamined assumption that all three forms should produce equivalent findings. The reader is encour- aged to abandon this unwarranted hypothesis. Even though the three forms of the RPM resemble one another, there may be subtle differences in the prob- lem-solving strategies required by each.

The Coloured Progressive Matrices is a 36- item test designed for children from 5 to 11 years of age. Raven incorporated colors into this version of the test to help hold the attention of the young chil- dren. The Standard Progressive Matrices is normed for examinees from 6 years and up, although most of the items are so difficult that the test is best suited for adults. This test consists of 60 items grouped into 5 sets of 12 progressions. The Advanced Progressive Matrices is similar to the Standard version but has a higher ceiling. The Advanced version consists of 12 problems in Set I and 36 problems in Set II. This form is especially suitable for persons of superior intellect.

Large sample U.S. norms for the Coloured and Standard Progressive Matrices are reported in Raven and Summers (1986). Separate norms for Mexican American and African American children are included. Although there was no attempt to use a stratified random-sampling procedure, the selec- tion of school districts was so widely varied that the American norms for children appear to be reason- ably sound. Sattler (1988) summarizes the relevant norms for all versions of the RPM. Raven, Court, and Raven (1992) produced new norms for the Standard Progressive Matrices,