Search

Summary

The argument-based validation research reported in this chapter was conducted from the perspective of an outside evaluator with concerns about the consistency of scores on the Telephone Standard Speaking Test (TSST), a telephone-based test of second language (L2) English speaking proficiency used to assess improvement in speaking proficiency over time. The test use requires that the warrant for generalization be plausible. It states that observed scores are estimates of expected scores, which are consistent across test tasks, forms, occasions, and raters. To guide the investigation a rebuttal, that observed scores fail to estimate expected scores due to error introduced in the testing process, was formulated. The research investigated two of its assumptions. Data of the TSST scores from 55 undergraduates at two Japanese universities collected twice within a month indicated that test forms had the same means and the same SDs, and that the two scores of each participant were highly correlated. One-third of scores for the same individual differed by one score level. Thus, the results found partial support for one of the assumptions underlying the rebuttal. This chapter concludes by highlighting the important role of rebuttals for including threats of concern to test users in an interpretation/use argument.

Summary

This argument-based validation research investigates the validity of score interpretations on a computer-based, graphic-prompt writing test, focusing on the generalization inference. The graphic-prompt writing test assesses examinees’ ability to incorporate visual graphic information into their writing,. Both analytic ratings on Graph Description, Content Development, Organization, and Grammar/Vocabulary (n = 2,424) and composite ratings (n = 606) on written test responses from 101 ESL students were analyzed using Generalizability (G) Theory and Multi-Faceted Rasch Measurement (MFRM). Findings indicated three of the four analytic scales and the composites yielded dependable scores. In addition, the results of the G-studies and MFRM analysis revealed the relative effects of the raters on the total score variance was not trivial for both composite and analytic scores and the three raters were not quite equivalent in their rating severity. Nevertheless, the findings support the generalization inference to a large extent. Thus, it can be claimed the graphic-prompt writing task scores were dependable enough to be used for the intended purposes, particularly with the two-rater and three-task test administration design.

Search Results

Refine search

Refine search

Actions for selected content:

2 results

7 - The Telephone Standard Speaking Test

Summary

6 - Generalization Inference for a Computer-Mediated Graphic-Prompt Writing Test for ESL Placement

Summary

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

2 results

7 - The Telephone Standard Speaking Test

Summary

6 - Generalization Inference for a Computer-Mediated Graphic-Prompt Writing Test for ESL Placement

Summary