Automatic regrouping of strata in the chi-square test

Juan Manuel Pérez-Salamero González; Marta Regúlez-Castillo; Manuel Ventura-Marco; Carlos Vidal-Meliá

Automatic regrouping of strata in the chi-square test

Juan Manuel Pérez-Salamero González ¹
Marta Regúlez-Castillo ²
Manuel Ventura-Marco ¹
Carlos Vidal-Meliá ¹³

1 Universitat de València

Universitat de València

Valencia, España

ROR https://ror.org/043nxc105
2 Universidad del País Vasco/Euskal Herriko Unibertsitatea

Universidad del País Vasco/Euskal Herriko Unibertsitatea

Lejona, España

ROR https://ror.org/000xsnr85
3 Universidad Complutense de Madrid

Universidad Complutense de Madrid

Madrid, España

ROR 02p0gd045

Mostrar afiliaciones +

Revista:

Documentos de Trabajo (ICAE)

ISSN: 2341-2356

Año de publicación: 2017

Número: 24

Páginas: 1-25

Tipo: Documento de Trabajo

DIALNET GOOGLE SCHOLAR Docta Complutense editor

Otras publicaciones en: Documentos de Trabajo (ICAE)

Resumen

Pearson´s chi-square test is widely employed in social and health science to analyze categorical data and contingency tables and to assess sample representativeness. For the test to be valid the sample size must be big enough to provide a minimum number of expected elements per category. If the researcher chooses to regroup the strata in order to solve the failure on the minimum size requirement, the existence of automatic re-grouping procedures in statistical software would be very useful, especially when tests are applied sequentially. After comprehensively reviewing the software that can carry out this test, we find that, with a few exceptions, there is no automatic regrouping of the strata to meet this requirement, although it would be very useful if this were available. This paper develops some functions for regrouping strata automatically no matter where they are located, thus enabling the test to be performed within an iterative procedure. The functions are written in Excel VBA (Visual Basic for Applications) and in Mathematica, so it would not be hard to implement them in other languages. The utility of these functions is shown by using three different datasets. Finally, the iterative use of the functions is applied to the Continuous Sample of Working Lives, a dataset that has been used in a considerable number of studies, especially on labor economics and the Spanish public pension system.

€ Ver financiación

Información de financiación

We gratefully acknowledge financial support from Ministerio de Economía y Competitividad (Spain) and from the Basque Government via projects ECO2015-65826-P and IT 793-13 respectively.

Financiadores

Ministerio de Economía y Competitividad Spain
- ECO2015-65826-P
Basque Government Spain
- IT 793-13

Referencias bibliográficas

Agresti, A. (2002). Categorical Data Analysis (2nd edn). Wiley: Hoboken, New Jersey.
Bartholomew, D. J.; Knott, M. (1999). Latent variable models and factor analysis (2nd ed.). London: Arnold.
Bartholomew, D. J.; Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Sociological Methods and Research 27, 525-546. DOI: 10.1177/0049124199027004003
Bishop, Y. M. M.; Fienberg, S. E.; Holland, P. W. (1975). Discrete multivariate analysis: theory and practice. MIT Press, Cambridge.
Bosgiraud, J. (2006). Sur le regroupement des classes dans le test du Khi-2. Revue Romaine de Mathématiques Pures et Appliquées, 51 (2), 167-172.
Cai, L.; Maydeu-Olivares, A.; Coffman, D.L.; Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59 (1), 173-194. DOI: 10.1348/000711005X66419.
Campbell, I. (2007). Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample recommendations, Statistics in Medicine 26 (19), 3661-3675. DOI: 10.1002/sim.2832.
Cochran, W .G. (1952). The Ï2 test of goodness of fit, The Annals of Mathematical Statistics. 23 (3), 315-345. http://www.jstor.org/stable/2236678
Collins, L. M.; Fidler, P. L.; Wugalter, S. E.; Long, J. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28 (3), 375-389. DOI: 10.1207/s15327906mbr2803_4.
Delucchi, K. L. (1983). The Use and Misuse of Chi-Square: Lewis and Burke Revisited. Psychological Bulletin, 94 (1), 166-176. DOI: 10.1037/0033-2909.94.
Dirección General de Ordenación. Secretaría de Estado de la Seguridad Social (DGOSS). (2014), Muestra Continua de Vidas Laborales 2013. Madrid: Ministerio de Trabajo e Inmigración. España.
Fienberg, S. E. (2006). Log-linear Models in Contingency Tables. Encyclopedia of Statistical Sciences. 7. Wiley, New York.
Fisher R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98 (1), 39-54. DOI: 10.2307/2342435.
Goodman, L. A. (1974). Exploratory Latent Structures analysis Using Both Identifiable and Unidentifiable Models. Biometrika, 61 (2),215-231. DOI: 10.2307/2334349.
Grafström, A.; Schelin, L. (2014). How to Select Representative Samples. Scandinavian Journal of Statistics, 41 (2), 277-290. DOI: 10.1111/sjos.12016.
Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency-tables. Statistics in Medicine 9 (4), 363-367. DOI: 10.1002/sim.4780090403.
Hirji, K. F. (2006). Exact Analysis of Discrete Data. Chapman & Hall: Boca Raton.
Hosmer, D. W.; Hosmer, T.; Le Cessie, S.; Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in medicine, 16 (9), 965-980. DOI: 10.1002/(SICI)1097-0258(19970515)16:9<965::AIDSIM509>3.0.CO;2-O.
Hosmer D.W.; Lemeshow, S. (2000). Applied logistic regression, 2nd edn. Wiley, New York.
INSS. (2014). Informe Estadístico 2013. Madrid: INSS. Secretaría de Estado de la Seguridad Social. Ministerio de Empleo y Seguridad Social.
Keeling, K. B.; Pavur, R. J. (2011). Statistical Accuracy of Spreadsheet Software. The American Statistician, 65 (4), 265-273. DOI: 10.1198/tas.2011.09076
Khan, H.A. (2003). A Visual Basic Software for Computing Fisher’s Exact Probability. Journal of Statistical Software, 8 (21), 1-7. DOI: 10.18637/jss.v008.i21.
Kroonenberg, P.M.; Verbeek, A. (2017). The Tale of Cochran's Rule: My Contingency Table has so Many Expected Values Smaller than 5, What Am I to Do? The American Statistician. DOI: 10.1080/00031305.2017.1286260.
Kruskall, W.; Mosteller, F. (1979a). Representative Sampling, I. International Statistical Review/ Revue Internationale de Statistique, 47 (1), 13-24. DOI: 10.2307/1403202.
Kruskall, W.; Mosteller, F. (1979b). Representative Sampling, II: Scientific Literature, Excluding Statistics. International Statistical Review / Revue Internationale de Statistique, 47 (2), 111-127. DOI: 10.2307/1402564.
Kruskall, W.; Mosteller, F. (1979c). Representative Sampling, III: The current Statistical Literature. International Statistical Review / Revue Internationale de Statistique, 47 (3), 245-265. DOI: 10.2307/1402647.
Kruskall, W.; Mosteller, F. (1980). Representative Sampling, IV: The History of the Concept in Statistics, 1895-1939. International Statistical Review / Revue Internationale de Statistique, 48 (2), 169-195. DOI : 10.2307/1403151.
Larose, D. T.; Larose C. D. (2014). Discovering knowledge in data: an introduction to data mining. John Wiley y Sons. DOI : 10.1002/9781118874059
Lazarsfeld, P.F.; Henry, N.W. (1968). Latent Structure Analysis, Houghton Mifflin, Boston.
Lewis, D. & Burke, C. J. (1949). The Use and Misuse of Chi-Square. Psychological Bulletin, 46 (6), 433-489. DOI : 10.1037/h0059088.
Lin, J. J.; Chang, C. H. & Pal, N. (2015). A Revisit to Contingency Table and Tests of Independence: Bootstrap is Preferred to Chi-Square Approximations as Well as Fisher’s exact test. Journal of Biopharmaceutical Statistics, 25 (3), 438-458. DOI: 10.1080/10543406.2014.920851.
Lydersen, S.; Fagerland, M.W. & Laake, P (2009). Tutorial in Biostatistics. Recommended tests for association in 2×2 tables. Statistics in Medicine, 28 (7), 1159–1175. DOI: 10.1002/sim.3531. DOI: 10.1002/sim.3531.
Marsaglia, George (2003) "Random Number Generators," Journal of Modern Applied Statistical Methods, 2 (1) , 2-13. DOI: 10.22237/jmasm/1051747320. DOI: 10.22237/jmasm/1051747320.
McCullough, B. D. (2000). The Accuracy of Mathematica 4 as a Statistical Package. Computational Statistics, 15 (2), 279-299. DOI:10.1007/PL00022713
McCullough, B. D. (2008). Special section on Microsoft Excel 2007. Computational Statistics and Data Analysis, 52 (10), 4568-4569. DOI:10.1016/j.csda.2008.03.009.
Mehta C.R.; Patel N.R. (1983). A Network Algorithm for Performing Fisher's Exact Test in r x c Contingency Tables. Journal of the American Statistical Association. 78 (382), 427-434. DOI: 10.2307/2288652
Ministerio de Empleo y Seguridad Social, Secretaría de Estado de Seguridad Social (MESS). (2017), “La Muestra Continua de Vidas Laborales. Guía del contenido”. Estadísticas, Presupuestos y Estudios. Estadísticas. [Última consulta: 8-4-2017]: http://www.segsocial.es/prdi00/groups/public/documents/binario/190489.pdf.
Moore, D. (1986). “Test of Chi-Squared Type” in D’Agostino, R. & Stephens, M., eds. Goodness of Fit Techniques, Marcel-Decker, New York, 63-95.
Okeniyi, J. O. ; Okeniyi, E. T. (2012). Implementation of Kolmogorov–Smirnov Pvalue computation in Visual Basic: implication for Microsoft Excel library function. Journal of Statistical Computation and Simulation, 82 (12), 1727- 1741. DOI:10.1080/00949655.2011.593035.
Omair, A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2 (4), 142-147. DOI: 10.4103/1658-600X.142783.
Pearson, K. (1900). On the Criterion that a given System of Deviations from the Probable in the Case of a Correlated System of Variables is such that it can be reasonably supposed to have arisen from Random Sampling. Philosophical Magazine, 50 (5), 157-175. DOI: 10.1080/14786440009463897.
Pérez-Salamero González, J.M., Regúlez-Castillo, M. & Vidal-Meliá, C. (2017), The Continuous Sample of Working Lives: improving its representativeness. SERIEs. Journal of the Spanish Economic Association, 8 (1), 43-95. DOI: 10.1007/s13209-017-0154-0.
Pérez-Salamero González, J.M., Regúlez-Castillo, M. & Vidal-Meliá, C. (2016), Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española (Review of Public Economics), 217-(2/2016): 67-130
Pérez-Salamero González, J. M. (2015). La Muestra Continua de Vidas Laborales (MCVL) como fuente generadora de datos para el estudio del sistema de pensiones. Tesis Doctoral. Universitat de València.
Quintela-del-Río, A. Fernández, M. F. (2016). Excel Templates: A Helpful Tool for Teaching Statistics. The American Statistician. DOI: 10.1080/00031305.2016.1186115
Ramsey, C. A.; Hewitt, A. D. (2005). A Methodology for Assessing Sample Representativeness. Environmental Forensics, 6, 71-75. DOI: 10.1080/15275920590913877
Ripley, Brian D. (2002). Statistical methods need software: A view of statistical computing. Opening lecture RSS, Plymouth.
Ross, A. (2015). Probability or statistics-Performing a chi-square goodness of fit testMathematical Stack Exchange. (Retrieved: 30/4/2016) url: http://mathematica.stackexchange.com/questions/5579/permorfing-a-chi-squaregoodness-of-fit-test/5590#5590
Tsang, W.W.; Cheng, K.H. (2006). The Chi-square test when the expected frequencies are less than 5, COMPSTAT 2006 Proceedings in Computational Statistics, edited by A. Rizzi and M. Vichi, Physica Verlag (Springer), pp. 1583-1589.
Tollenaar, N.; Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and limited information. British Journal of Mathematical and Statistical Psychology, 56 (2), 271-288. DOI: 10.1348/000711003770480048.
Wilkinson, L. (1994). Practical guidelines for testing statistical software. In Dirschedl, P. and Ostermann, R. (Eds.), Computational Statistics. Heidelberg: PhysicaVerlag
Yates, F. (1934). Contingency tables involving small numbers and the Ï2 test. Journal of the Royal Statistical Society, Suppl.1, 217-235. DOI: 10.2307/2983604.

Fuente de los datos: Dialnet

Automatic regrouping of strata in the chi-square test

Universitat de València

Universidad del País Vasco/Euskal Herriko Unibertsitatea

Universidad Complutense de Madrid

Resumen

Información de financiación

Financiadores

Referencias bibliográficas