Automatic regrouping of strata in the chi-square test

  1. Juan Manuel Pérez-Salamero González 1
  2. Marta Regúlez-Castillo 2
  3. Manuel Ventura-Marco 1
  4. Carlos Vidal-Meliá 13
  1. 1 Universitat de València

    Universitat de València

    Valencia, España


  2. 2 Universidad del País Vasco/Euskal Herriko Unibertsitatea

    Universidad del País Vasco/Euskal Herriko Unibertsitatea

    Lejona, España


  3. 3 Universidad Complutense de Madrid

    Universidad Complutense de Madrid

    Madrid, España

    ROR 02p0gd045

Documentos de Trabajo (ICAE)

ISSN: 2341-2356

Year of publication: 2017

Issue: 24

Pages: 1-25

Type: Working paper

More publications in: Documentos de Trabajo (ICAE)


Pearson´s chi-square test is widely employed in social and health science to analyze categorical data and contingency tables and to assess sample representativeness. For the test to be valid the sample size must be big enough to provide a minimum number of expected elements per category. If the researcher chooses to regroup the strata in order to solve the failure on the minimum size requirement, the existence of automatic re-grouping procedures in statistical software would be very useful, especially when tests are applied sequentially. After comprehensively reviewing the software that can carry out this test, we find that, with a few exceptions, there is no automatic regrouping of the strata to meet this requirement, although it would be very useful if this were available. This paper develops some functions for regrouping strata automatically no matter where they are located, thus enabling the test to be performed within an iterative procedure. The functions are written in Excel VBA (Visual Basic for Applications) and in Mathematica, so it would not be hard to implement them in other languages. The utility of these functions is shown by using three different datasets. Finally, the iterative use of the functions is applied to the Continuous Sample of Working Lives, a dataset that has been used in a considerable number of studies, especially on labor economics and the Spanish public pension system.

Funding information

We gratefully acknowledge financial support from Ministerio de Economía y Competitividad (Spain) and from the Basque Government via projects ECO2015-65826-P and IT 793-13 respectively.


Bibliographic References

  • Agresti, A. (2002). Categorical Data Analysis (2nd edn). Wiley: Hoboken, New Jersey.
  • Bartholomew, D. J.; Knott, M. (1999). Latent variable models and factor analysis (2nd ed.). London: Arnold.
  • Bartholomew, D. J.; Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Sociological Methods and Research 27, 525-546. DOI: 10.1177/0049124199027004003
  • Bishop, Y. M. M.; Fienberg, S. E.; Holland, P. W. (1975). Discrete multivariate analysis: theory and practice. MIT Press, Cambridge.
  • Bosgiraud, J. (2006). Sur le regroupement des classes dans le test du Khi-2. Revue Romaine de Mathématiques Pures et Appliquées, 51 (2), 167-172.
  • Cai, L.; Maydeu-Olivares, A.; Coffman, D.L.; Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59 (1), 173-194. DOI: 10.1348/000711005X66419.
  • Campbell, I. (2007). Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample recommendations, Statistics in Medicine 26 (19), 3661-3675. DOI: 10.1002/sim.2832.
  • Cochran, W .G. (1952). The χ2 test of goodness of fit, The Annals of Mathematical Statistics. 23 (3), 315-345.
  • Collins, L. M.; Fidler, P. L.; Wugalter, S. E.; Long, J. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28 (3), 375-389. DOI: 10.1207/s15327906mbr2803_4.
  • Delucchi, K. L. (1983). The Use and Misuse of Chi-Square: Lewis and Burke Revisited. Psychological Bulletin, 94 (1), 166-176. DOI: 10.1037/0033-2909.94.
  • Dirección General de Ordenación. Secretaría de Estado de la Seguridad Social (DGOSS). (2014), Muestra Continua de Vidas Laborales 2013. Madrid: Ministerio de Trabajo e Inmigración. España.
  • Fienberg, S. E. (2006). Log-linear Models in Contingency Tables. Encyclopedia of Statistical Sciences. 7. Wiley, New York.
  • Fisher R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98 (1), 39-54. DOI: 10.2307/2342435.
  • Goodman, L. A. (1974). Exploratory Latent Structures analysis Using Both Identifiable and Unidentifiable Models. Biometrika, 61 (2),215-231. DOI: 10.2307/2334349.
  • Grafström, A.; Schelin, L. (2014). How to Select Representative Samples. Scandinavian Journal of Statistics, 41 (2), 277-290. DOI: 10.1111/sjos.12016.
  • Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency-tables. Statistics in Medicine 9 (4), 363-367. DOI: 10.1002/sim.4780090403.
  • Hirji, K. F. (2006). Exact Analysis of Discrete Data. Chapman & Hall: Boca Raton.
  • Hosmer, D. W.; Hosmer, T.; Le Cessie, S.; Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in medicine, 16 (9), 965-980. DOI: 10.1002/(SICI)1097-0258(19970515)16:9<965::AIDSIM509>3.0.CO;2-O.
  • Hosmer D.W.; Lemeshow, S. (2000). Applied logistic regression, 2nd edn. Wiley, New York.
  • INSS. (2014). Informe Estadístico 2013. Madrid: INSS. Secretaría de Estado de la Seguridad Social. Ministerio de Empleo y Seguridad Social.
  • Keeling, K. B.; Pavur, R. J. (2011). Statistical Accuracy of Spreadsheet Software. The American Statistician, 65 (4), 265-273. DOI: 10.1198/tas.2011.09076
  • Khan, H.A. (2003). A Visual Basic Software for Computing Fisher’s Exact Probability. Journal of Statistical Software, 8 (21), 1-7. DOI: 10.18637/jss.v008.i21.
  • Kroonenberg, P.M.; Verbeek, A. (2017). The Tale of Cochran's Rule: My Contingency Table has so Many Expected Values Smaller than 5, What Am I to Do? The American Statistician. DOI: 10.1080/00031305.2017.1286260.
  • Kruskall, W.; Mosteller, F. (1979a). Representative Sampling, I. International Statistical Review/ Revue Internationale de Statistique, 47 (1), 13-24. DOI: 10.2307/1403202.
  • Kruskall, W.; Mosteller, F. (1979b). Representative Sampling, II: Scientific Literature, Excluding Statistics. International Statistical Review / Revue Internationale de Statistique, 47 (2), 111-127. DOI: 10.2307/1402564.
  • Kruskall, W.; Mosteller, F. (1979c). Representative Sampling, III: The current Statistical Literature. International Statistical Review / Revue Internationale de Statistique, 47 (3), 245-265. DOI: 10.2307/1402647.
  • Kruskall, W.; Mosteller, F. (1980). Representative Sampling, IV: The History of the Concept in Statistics, 1895-1939. International Statistical Review / Revue Internationale de Statistique, 48 (2), 169-195. DOI : 10.2307/1403151.
  • Larose, D. T.; Larose C. D. (2014). Discovering knowledge in data: an introduction to data mining. John Wiley y Sons. DOI : 10.1002/9781118874059
  • Lazarsfeld, P.F.; Henry, N.W. (1968). Latent Structure Analysis, Houghton Mifflin, Boston.
  • Lewis, D. & Burke, C. J. (1949). The Use and Misuse of Chi-Square. Psychological Bulletin, 46 (6), 433-489. DOI : 10.1037/h0059088.
  • Lin, J. J.; Chang, C. H. & Pal, N. (2015). A Revisit to Contingency Table and Tests of Independence: Bootstrap is Preferred to Chi-Square Approximations as Well as Fisher’s exact test. Journal of Biopharmaceutical Statistics, 25 (3), 438-458. DOI: 10.1080/10543406.2014.920851.
  • Lydersen, S.; Fagerland, M.W. & Laake, P (2009). Tutorial in Biostatistics. Recommended tests for association in 2×2 tables. Statistics in Medicine, 28 (7), 1159–1175. DOI: 10.1002/sim.3531. DOI: 10.1002/sim.3531.
  • Marsaglia, George (2003) "Random Number Generators," Journal of Modern Applied Statistical Methods, 2 (1) , 2-13. DOI: 10.22237/jmasm/1051747320. DOI: 10.22237/jmasm/1051747320.
  • McCullough, B. D. (2000). The Accuracy of Mathematica 4 as a Statistical Package. Computational Statistics, 15 (2), 279-299. DOI:10.1007/PL00022713
  • McCullough, B. D. (2008). Special section on Microsoft Excel 2007. Computational Statistics and Data Analysis, 52 (10), 4568-4569. DOI:10.1016/j.csda.2008.03.009.
  • Mehta C.R.; Patel N.R. (1983). A Network Algorithm for Performing Fisher's Exact Test in r x c Contingency Tables. Journal of the American Statistical Association. 78 (382), 427-434. DOI: 10.2307/2288652
  • Ministerio de Empleo y Seguridad Social, Secretaría de Estado de Seguridad Social (MESS). (2017), “La Muestra Continua de Vidas Laborales. Guía del contenido”. Estadísticas, Presupuestos y Estudios. Estadísticas. [Última consulta: 8-4-2017]:
  • Moore, D. (1986). “Test of Chi-Squared Type” in D’Agostino, R. & Stephens, M., eds. Goodness of Fit Techniques, Marcel-Decker, New York, 63-95.
  • Okeniyi, J. O. ; Okeniyi, E. T. (2012). Implementation of Kolmogorov–Smirnov Pvalue computation in Visual Basic: implication for Microsoft Excel library function. Journal of Statistical Computation and Simulation, 82 (12), 1727- 1741. DOI:10.1080/00949655.2011.593035.
  • Omair, A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2 (4), 142-147. DOI: 10.4103/1658-600X.142783.
  • Pearson, K. (1900). On the Criterion that a given System of Deviations from the Probable in the Case of a Correlated System of Variables is such that it can be reasonably supposed to have arisen from Random Sampling. Philosophical Magazine, 50 (5), 157-175. DOI: 10.1080/14786440009463897.
  • Pérez-Salamero González, J.M., Regúlez-Castillo, M. & Vidal-Meliá, C. (2017), The Continuous Sample of Working Lives: improving its representativeness. SERIEs. Journal of the Spanish Economic Association, 8 (1), 43-95. DOI: 10.1007/s13209-017-0154-0.
  • Pérez-Salamero González, J.M., Regúlez-Castillo, M. & Vidal-Meliá, C. (2016), Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española (Review of Public Economics), 217-(2/2016): 67-130
  • Pérez-Salamero González, J. M. (2015). La Muestra Continua de Vidas Laborales (MCVL) como fuente generadora de datos para el estudio del sistema de pensiones. Tesis Doctoral. Universitat de València.
  • Quintela-del-Río, A. Fernández, M. F. (2016). Excel Templates: A Helpful Tool for Teaching Statistics. The American Statistician. DOI: 10.1080/00031305.2016.1186115
  • Ramsey, C. A.; Hewitt, A. D. (2005). A Methodology for Assessing Sample Representativeness. Environmental Forensics, 6, 71-75. DOI: 10.1080/15275920590913877
  • Ripley, Brian D. (2002). Statistical methods need software: A view of statistical computing. Opening lecture RSS, Plymouth.
  • Ross, A. (2015). Probability or statistics-Performing a chi-square goodness of fit testMathematical Stack Exchange. (Retrieved: 30/4/2016) url:
  • Tsang, W.W.; Cheng, K.H. (2006). The Chi-square test when the expected frequencies are less than 5, COMPSTAT 2006 Proceedings in Computational Statistics, edited by A. Rizzi and M. Vichi, Physica Verlag (Springer), pp. 1583-1589.
  • Tollenaar, N.; Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and limited information. British Journal of Mathematical and Statistical Psychology, 56 (2), 271-288. DOI: 10.1348/000711003770480048.
  • Wilkinson, L. (1994). Practical guidelines for testing statistical software. In Dirschedl, P. and Ostermann, R. (Eds.), Computational Statistics. Heidelberg: PhysicaVerlag
  • Yates, F. (1934). Contingency tables involving small numbers and the χ2 test. Journal of the Royal Statistical Society, Suppl.1, 217-235. DOI: 10.2307/2983604.