1. Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115. [
DOI:10.1177/0265532215582283]
2. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. [
DOI:10.1017/CBO9780511667350]
3. Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study on their veridicality and reactivity. Language Testing, 28(1), 51-75. [
DOI:10.1177/0265532210376379]
4. Bijani, H. (2010). Raters' perception and expertise in evaluating second language compositions. The Journal of Applied Linguistics, 3(2), 69-89.
5. Bijani, H., & Fahim, M. (2011). The effects of rater training on raters' severity and bias analysis in second language writing. Iranian Journal of Language Testing, 1(1), 1-16.
6. Cohen, L., Manion, L., & Morrison, K. (2007). Research methods in education. London: Routledge. [
DOI:10.4324/9780203029053]
7. Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135. [
DOI:10.1177/0265532215582282]
8. Dörnyei, Z. (2007). Research methods in applied linguistics: Quantitative, qualitative and mixed methodologies. Oxford: Oxford University Press.
9. Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang Edition.
10. Fan, J., & Yan, X. (2020). Assessing speaking proficiency: a narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11(1), 1-14. [
DOI:10.3389/fpsyg.2020.00330]
11. Flake, J. K. (2021). Strengthening the foundation of educational psychology by integrating construct validation into open science reform. Educational Psychologist, 56(2), 132-141. http://doi.org/10.1080/00461520.2021.1898962 [
DOI:10.1080/00461520.2021.1898962]
12. Gan, Z. (2010). Interaction in group oral assessment: A case study of higher-and lower-scoring students. Language Testing, 27(4), 585-602. doi:10.1177/0265532210364049 [
DOI:10.1177/0265532210364049]
13. Huang, B. H., Bailey, A. L., Sass, D. A., & Shawn Chang, Y. (2020). An investigation of the validity of a speaking assessment for adolescent English language learners. Language Testing, 37(2), 1-28. [
DOI:10.1177/0265532220925731]
14. Huang, H., Huang, S., & Hong, H. (2016). Test-taker characteristics and integrated speaking test performance: A path-analytic study. Language Assessment Quarterly, 13(4), 283-301. [
DOI:10.1080/15434303.2016.1236111]
15. In'nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366. [
DOI:10.1177/0265532215587390]
16. Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing, 34(1), 23-48. [
DOI:10.1177/0265532215595666]
17. Kim, H. J. (2011). Investigating raters' development of rating ability on a second language speaking assessment. Unpublished PhD thesis, University of Columbia.
18. Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239-261. [
DOI:10.1080/15434303.2015.1049353]
19. Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275-304. [
DOI:10.1177/0265532208101008]
20. Kuiken, F., & Vedder, I. (2014). Raters' decisions, rating procedures, and rating scales. Language Testing, 31(3), 279-284. [
DOI:10.1177/0265532214526179]
21. Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319-340. [
DOI:10.1177/0265532215587391]
22. Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
23. Luoma, S. (2004). Assessing speaking. Cambridge. Cambridge University Press. [
DOI:10.1017/CBO9780511733017]
24. May, L. (2009). Co-constructed interaction in a paired speaking test: The rater's perspective. Language Testing, 26(3), 397-421. [
DOI:10.1177/0265532209104668]
25. McNamara, T. F. (1996). Measuring second language performance. London: Longman.
26. McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-156. [
DOI:10.1177/026553229701400202]
27. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement. Journal of Applied Measurement, 5(2), 189-227.
28. Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508. [
DOI:10.1177/0265532211398110]
29. Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355-390. [
DOI:10.1177/0265532207077205]
30. Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493. [
DOI:10.1177/0265532208094273]
31. Tarone, E. (1983). On the variability of interlanguage systems. Applied Linguistics, 4(2), 142-164.
https://doi.org/10.1093/applin/4.2.142 [
DOI:10.1093/APPLIN/4.2.142]
32. Tavakoli, P., Nakatsuhara, F., & Hunter, A. M. (2020). Aspects of fluency across assessed levels of speaking proficiency. The Modern Language Journal, 104(1), 169-191. [
DOI:10.1111/modl.12620]
33. Theobald, M. (2021). Self-regulated learning training programs enhance university students' academic performance, self-regulated learning strategies, and motivation: A meta-analysis. Contemporary Educational Psychology, 66, 101976. [
DOI:10.1016/j.cedpsych.2021.101976]
34. Trace, J., Janssen, G., & Meier, V. (2017). Measuring the impact of rater negotiation in writing performance assessment. Language Testing, 34(1), 3-22. [
DOI:10.1177/0265532215594830]
35. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. [
DOI:10.1017/CBO9780511732997]
36. Winke, P., & Gass, S. (2013). The influence of second language experience and accent familiarity on oral proficiency rating: A qualitative investigation. TESOL Quarterly, 47(4), 762-789. [
DOI:10.1002/tesq.73]
37. Winke, P., Gass, S., & Myford, C. (2012). Raters' L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252. [
DOI:10.1177/0265532212456968]
38. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.