The aid of machine learning to overcome the classification of real health discharge reports written in Spanish

  1. Alicia Pérez
  2. Arantza Casillas
  3. Koldo Gojenola
  4. Maite Oronoz
  5. Nerea Aguirre
  6. Estibaliz Amillano
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2014

Issue: 53

Pages: 77-84

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

Hospitals attached to the Spanish Ministry of Health are currently using the International Classification of Diseases 9 Clinical Modification (ICD9-CM) to classify health discharge records. Nowadays, this work is manually done by experts. This paper tackles the automatic classification of real Discharge Records in Spanish following the ICD9-CM standard. The challenge is that the Discharge Records are written in spontaneous language. We explore several machine learning techniques to deal with the classification problem. Random Forest resulted in the most competitive one, achieving an F-measure of 0.876.

Bibliographic References

  • Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer.
  • Chang, C. C. and C. J. Lin. 2001. Libsvm: a library for support vector machines.
  • Ferrao, J. C., M. D. Oliveira, F. Janela, and H.M.G. Martins. 2012. Clinical coding support based on structured data stored in electronic health records. In Bioinformatics and Biomedicine Workshops, 2012 IEEE International Conference on, pages 790-797.
  • Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10-18.
  • Lang, D. 2007. Natural language processing in the health care industry. Consultant report, Cincinnati Children's Hospital Medical Center.
  • Mitchell, T. 1997. Machine Learning. McGraw Hill.
  • Peng, H., C. Gates, B. Sarma, N. Li, Y. Qi, R. Potharaju, C. Nita-Rotaru, and I. Molloy. 2012. Using probabilistic generative models for ranking risks of android apps. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 241-252. ACM.
  • Pestian, J. P., C. Brew, P. Matykiewicz, D. J. Hovermale, N. Johnson, K. Bretonnel Cohen, and W. Duch. 2007. A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing, pages 97-104. Association for Computational Linguistics.
  • Platt, J. C. 1999. Fast training of support vector machines using sequential minimal optimization. MIT press.
  • Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
  • Rodríguez, J. D., A. Pérez, D. Arteta, D. Tejedor, and J. A. Lozano. 2012. Using multidimensional bayesian network classifiers to assist the treatment of multiple sclerosis. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(6):1705-1715.
  • Soni, J., U. Ansari, D. Sharma, and S. Soni. 2011. Predictive data mining for medical diagnosis: An overview of heart disease prediction. International Journal of Computer Applications, 17.
  • Sriram, B., D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. 2010. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 841-842. ACM.