Psychological and Educational Test Score Comparability across Groups in the Presence of Item Bias

  1. Paula Elosua 1
  2. Ronald K. Hambleton 2
  1. 1 Univesidad del País Vasco
  2. 2 University of Massachusetts System
    info

    University of Massachusetts System

    Boston, Estados Unidos

    ROR https://ror.org/0260j1g46

Aldizkaria:
Revista de Psicología y Educación

ISSN: 1699-9517

Argitalpen urtea: 2018

Alea: 13

Zenbakia: 1

Orrialdeak: 23-32

Mota: Artikulua

Beste argitalpen batzuk: Revista de Psicología y Educación

Laburpena

It is common for test publishers to make their most popular educational and psychological tests available in multiple languages and cultures. Occasionally, too, test items are found after publication of these new language versions of tests that may disadvantage members taking these translated tests due to biases. This means that when these tests are used, scores for candidates will be underestimated to some extent and test score validity will be adversely affected. The purpose of this paper is to introduce and demonstrate one possible technical solution to the problem—it involves both differential item functioning (DIF) analysis and statistically equating of test scores. This solution involves two steps: First, any DIF items must be identified using one or more of the standard DIF detection procedures in the language or cultural group of interest. Second, after removing DIF items that may be biasing score interpretations from the actual test scoring, a statistical equating between the original test, and the reduced (shortened) test in the second language/cultural group can be carried out. A demonstration of the methodology is provided in the paper along with a discussion of the advantages and disadvantages of the solution.

Erreferentzia bibliografikoak

  • Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and Standardization. In P. W. Holland, and H. Wainer (Eds.), Differential Item Functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
  • Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355-368. doi: 10.1111/j.1745-3984.1986. tb00255.x
  • Elosua, P., & Iliescu, D. (2012). Tests in Europe. Where we are and where we should to go. International Journal of Testing, 12, 157-175. doi: 10.1080/15305058.2012.657316
  • Elosua, P., & López, A. (2007). Potential DIF sources in the adaptation of tests. International Journal of Testing, 7, 39-52. doi: 10.1080/15305050709336857
  • Elosua, P., & López-Jáuregui, A. (2008). Equating between linguistically different tests. Journal of Experimental Education, 76, 387-402. doi: 10.3200/JEXE.76.4.387-402
  • Elosua, P., J. Mujika, Almeida, L., & Hermsosilla, D. (2014). Procedimientos analítico-racionales en la adaptación de tests. Adaptación al español de la Batería de Pruebas de Razonamiento. Revista Latinoamericana de Psicología, 46(2), 117-126. doi: 10.1016/ S0120-0534(14)70015-9
  • Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34, 363-373.
  • Hambleton, R. K. (1990). Item Response Theory: Introduction and Bibliography. Psicothema, 2, 97-107.
  • Hambleton, R. K., Merenda, P., & Spielberger, C. (2005). Adapting educational and psychological tests for cross-cultural assessment. Hillsdale, NJ: Lawrence Erlbaum Publishers.
  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications.
  • Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.129-145). Hillsdale, N.J: Lawrence Erlbaum Publishers.
  • Kim, S. H., & Cohen, A. S. (1991). A comparison of two area measures for detecting differential item functioning. Applied Psychological Measurement, 15(3), 269278. doi:10.1177/014662169101500307
  • Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling and linking. New York: Springer-Verlag.
  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.
  • Muñiz, J., Elosua, P., & Hambleton, R. K. (2013). Directrices para la traducción y adaptación de los tests: Segunda edición. Psicothema, 25, 149-155. doi: 10.7334/ psicothema2013.24
  • Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43,1-17. doi: 10.1111/j.1745-3984.2006.00001.x
  • Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics, vol 26: Psychometrics (pp. 125-167). Amsterdam, Netherlands: Elsevier.
  • Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495-502. doi: 10.1007/BF02294403
  • Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353-368. doi: 10.1177/014662169501900405
  • Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85-100). New York: Springer-Verlag
  • Spray, J. A., & Miller, T. (1994). Identifying nonuniform DIF in polytomously-scored test items. Iowa City, IA: American College Testing Program.
  • Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. doi: 10.1177/0146621616668015
  • van de Vijver, F. J. R., & Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross-cultural assessment. European Journal of Psychological Assessment, 13, 29-37. doi: 10.1027/1015-5759.13.1.29
  • von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The Kernel method of test equating. New York: Springer.
  • Zenisky, A.L., Hambleton, R. K., & Robin, F. (2003). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1&2), 61-78. doi: 10.1080/10627197.2004.9652959
  • Zwick, R., Donogue, J. R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30(3), 233-251.doi: 10.1111/j.1745-3984.1993. tb00425.x
  • Zwick, R., & Thayer, D. T. (1996). Evaluating the magnitude of differential item functioning in polytomous items. Journal of Educational and Behavioral Statistics, 21(3), 187-201. doi: 10.3102/10769986021003187