Development and Validation of a Training-Embedded Speaking Assessment Rating Scale: A Multifaceted Rasch Analysis in Speaking Assessment

Bijani, Houman; Hashempour, Bahareh; Said Bani Orabah, Salim

doi:10.52547/ijree.7.3.32

Volume 7, Issue 3 (9-2022) IJREE 2022, 7(3): 32-45 | Back to browse issues page

‎ 10.52547/ijree.7.3.32

‎ 20.1001.1.25384015.2022.7.3.2.5

Development and Validation of a Training-Embedded Speaking Assessment Rating Scale: A Multifaceted Rasch Analysis in Speaking Assessment

Houman Bijani ^*

, Bahareh Hashempour

, Salim Said Bani Orabah

Islamic Azad University, Zanjan Branch, Zanjan, Iran

Abstract: (2908 Views)

Performance testing including the use of rating scales has become widespread in the evaluation of second/foreign oral language assessment. However, no study has used Multifaceted Rasch Measurement (MFRM) including the facets of test takers’ ability, raters’ severity, group expertise, and scale category, in one study. 20 EFL teachers scored the speaking performance of 200 test-takers prior and subsequent to a rater training program using an analytic rating scale consisting of fluency, grammar, vocabulary, intelligibility, cohesion, and comprehension categories. The outcome demonstrated that the categories were at different levels of difficulty even after the training program. However, this outcome by no means indicated the uselessness of the training program since data analysis reflected the constructive influence of training in providing enough consistency in raters’ rating of each category of the rating scale at the post-training phase. Such an outcome indicated that raters could discriminate the various categories of the rating scale. The outcomes also indicated that MFRM can result in enhancement in rater training and functionality validation of the rating scale descriptors. The training helped raters use the descriptors of the rating scale more efficiently of its various band descriptors resulting in a reduced halo effect. The findings conveyed that stakeholders had better establish training programs to assist raters in better use of the rating scale categories of various levels of difficulty in an appropriate way. Further research could be done to make a comparative analysis between the outcome of this study and the one using a holistic rating scale in oral assessment.

Keywords: Bias, Interrater consistency, Intrarater consistency, Multifaceted Rasch Measurement (MFRM), Rater training, Rating scale

Full-Text [PDF 532 kb] (942 Downloads)

Type of Study: Research | Subject: Special

References

1. Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115. [DOI:10.1177/0265532215582283]

2. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. [DOI:10.1017/CBO9780511667350]

3. Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study on their veridicality and reactivity. Language Testing, 28(1), 51-75. [DOI:10.1177/0265532210376379]

4. Bijani, H. (2010). Raters' perception and expertise in evaluating second language compositions. The Journal of Applied Linguistics, 3(2), 69-89.

5. Bijani, H., & Fahim, M. (2011). The effects of rater training on raters' severity and bias analysis in second language writing. Iranian Journal of Language Testing, 1(1), 1-16.

6. Cohen, L., Manion, L., & Morrison, K. (2007). Research methods in education. London: Routledge. [DOI:10.4324/9780203029053]

7. Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135. [DOI:10.1177/0265532215582282]

8. Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford: Oxford University Press.

9. Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang Edition.

10. Fan, J., & Yan, X. (2020). Assessing speaking proficiency: a narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11(1), 1-14. [DOI:10.3389/fpsyg.2020.00330]

11. Flake, J. K. (2021). Strengthening the foundation of educational psychology by integrating construct validation into open science reform. Educational Psychologist, 56(2), 132-141. http://doi.org/10.1080/00461520.2021.1898962 [DOI:10.1080/00461520.2021.1898962]

12. Gan, Z. (2010). Interaction in group oral assessment: A case study of higher-and lower-scoring students. Language Testing, 27(4), 585-602. doi:10.1177/0265532210364049 [DOI:10.1177/0265532210364049]

13. Huang, B. H., Bailey, A. L., Sass, D. A., & Shawn Chang, Y. (2020). An investigation of the validity of a speaking assessment for adolescent English language learners. Language Testing, 37(2), 1-28. [DOI:10.1177/0265532220925731]

14. Huang, H., Huang, S., & Hong, H. (2016). Test-taker characteristics and integrated speaking test performance: A path-analytic study. Language Assessment Quarterly, 13(4), 283-301. [DOI:10.1080/15434303.2016.1236111]

15. In'nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366. [DOI:10.1177/0265532215587390]

16. Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing, 34(1), 23-48. [DOI:10.1177/0265532215595666]

17. Kim, H. J. (2011). Investigating raters' development of rating ability on a second language speaking assessment. Unpublished PhD thesis, University of Columbia.

18. Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239-261. [DOI:10.1080/15434303.2015.1049353]

19. Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275-304. [DOI:10.1177/0265532208101008]

20. Kuiken, F., & Vedder, I. (2014). Raters' decisions, rating procedures, and rating scales. Language Testing, 31(3), 279-284. [DOI:10.1177/0265532214526179]

21. Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319-340. [DOI:10.1177/0265532215587391]

22. Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.

23. Luoma, S. (2004). Assessing speaking. Cambridge. Cambridge University Press. [DOI:10.1017/CBO9780511733017]

24. May, L. (2009). Co-constructed interaction in a paired speaking test: The rater's perspective. Language Testing, 26(3), 397-421. [DOI:10.1177/0265532209104668]

25. McNamara, T. F. (1996). Measuring second language performance. London: Longman.

26. McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-156. [DOI:10.1177/026553229701400202]

27. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement. Journal of Applied Measurement, 5(2), 189-227.

28. Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508. [DOI:10.1177/0265532211398110]

29. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355-390. [DOI:10.1177/0265532207077205]

30. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493. [DOI:10.1177/0265532208094273]

31. Tarone, E. (1983). On the variability of interlanguage systems. Applied Linguistics, 4(2), 142-164. https://doi.org/10.1093/applin/4.2.142 [DOI:10.1093/APPLIN/4.2.142]

32. Tavakoli, P., Nakatsuhara, F., & Hunter, A. M. (2020). Aspects of fluency across assessed levels of speaking proficiency. The Modern Language Journal, 104(1), 169-191. [DOI:10.1111/modl.12620]

33. Theobald, M. (2021). Self-regulated learning training programs enhance university students' academic performance, self-regulated learning strategies, and motivation: A meta-analysis. Contemporary Educational Psychology, 66, 101976. [DOI:10.1016/j.cedpsych.2021.101976]

34. Trace, J., Janssen, G., & Meier, V. (2017). Measuring the impact of rater negotiation in writing performance assessment. Language Testing, 34(1), 3-22. [DOI:10.1177/0265532215594830]

35. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. [DOI:10.1017/CBO9780511732997]

36. Winke, P., & Gass, S. (2013). The influence of second language experience and accent familiarity on oral proficiency rating: A qualitative investigation. TESOL Quarterly, 47(4), 762-789. [DOI:10.1002/tesq.73]

37. Winke, P., Gass, S., & Myford, C. (2012). Raters' L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252. [DOI:10.1177/0265532212456968]

38. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.