Ensembles for clinical entity extraction

  1. Rebecka Weegar
  2. Alicia Pérez
  3. Hercules Dalianis
  4. Koldo Gojenola
  5. Arantza Casillas
  6. Maite Oronoz
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2018

Número: 60

Páginas: 13-20

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

Los informes médicos son una valiosa fuente de conocimiento clínico. Las técnicas de Procesamiento del Lenguaje Natural han sido aplicadas al procesamiento de informes médicos para diversas aplicaciones. Generalmente un primer paso es la detección de entidades médicas: identifcar medicamentos, enfermedades y partes del cuerpo. Sin embargo, la mayoría de los trabajos se han desarrollado para informes en Inglés. El objetivo de este trabajo es mejorar el reconocimiento de entidades médicas para otras lenguas diferentes a Inglés, comparando los mismos métodos en dos lenguas y utilizando agrupaciones de modelos. Los modelos han sido creados para informes médicos en Español y Sueco utilizando SVM, Perceptron, CRF y cuatro conjuntos diferentes de atributos, incluyendo atributos no supervisados. Para el modelo combinado se ha aplicado votación ponderada teniendo en cuenta la F-measure individual. En conclusión, el modelo combinado mejora el rendimiento general y para posibles mejoras debemos investigar métodos más sofisticados de agrupación.

Información de financiación

This work has been partially funded by the Spanish ministry (PROSAMED: TIN2016- 77820-C3-1-R, TADEEP: TIN2015-70214-P), the Basque Government (DETEAMI: 2014111003), the University of the Basque Country UPV-EHU (MOV17/14) and the Nordic Center of Excellence in Health-Related e-Sciences (NIASC).

Financiadores

Referencias bibliográficas

  • Agerri, R. and G. Rigau. 2016. Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence, 238:63–82.
  • Brown, P. F., P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Classbased n-gram models of natural language. Computational linguistics, 18(4):467–479.
  • Carreras, X., L. Márquez, and L. Padró. 2003. Learning a perceptron-based named entity chunker via online recognition feedback. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 156– 159. Association for Computational Linguistics.
  • Crammer, K., M. Dredze, K. Ganchev, P. P. Talukdar, and S. Carroll. 2007. Automatic code assignment to medical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pages 129–136. Association for Computational Linguistics.
  • Dalianis, H., A. Henriksson, M. Kvist, S. Velupillai, and R. Weegar. 2015. HEALTH BANK–A Workbench for Data Science Applications in Healthcare. In Proceedings of the CAiSE-2015 Industry Track co-located with 27th Conference on Advanced Information Systems Engineering (CAiSE 2015), J. Krogstie, G. JuelSkielse and V. Kabilan, (Eds.), Stockholm, Sweden, June 11, 2015, CEUR, Vol-1381, pages 1–18.
  • Demner-Fushman, D., W. W. Chapman, and C. J. McDonald. 2009. What can natural language processing do for clinical decision support? Journal of biomedical informatics, 42(5):760–772.
  • Ehrentraut, C., H. Tanushi, H. Dalianis, and J. Tiedemann. 2012. Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records. In Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data.
  • Faruqui, M. and S. Padó. 2010. Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In KONVENS, pages 129–133.
  • Florian, R., A. Ittycheriah, H. Jing, and T. Zhang. 2003. Named entity recognition through classifier combination. In Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003-Volume 4, pages 168–171. Association for Computational Linguistics.
  • Haas, J. P., E. A. Mendonça, B. Ross, C. Friedman, and E. Larson. 2005. Use of computerized surveillance to detect nosocomial pneumonia in neonatal intensive care unit patients. American journal of infection control, 33(8):439–443.
  • Henriksson, A. 2015. Ensembles of semantic spaces: On combining models of distributional semantics with applications in healthcare. Ph.D. thesis, Department of Computer and Systems Sciences, Stockholm University.
  • Henriksson, A., M. Kvist, H. Dalianis, and M. Duneld. 2015. Identifying adverse drug event information in clinical notes with distributional semantic representations of context. Journal of Biomedical Informatics, 57:333–349.
  • Kang, N., Z. Afzal, B. Singh, E. M. Van Mulligen, and J. A. Kors. 2012. Using an ensemble system to improve concept extraction from clinical records. Journal of biomedical informatics, 45(3):423–428.
  • Lafferty, J. D., A. McCallum, and F. C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Lai, K. H., M. Topaz, F. R. Goss, and L. Zhou. 2015. Automated misspelling detection and correction in clinical freetext records. Journal of Biomedical Informatics, 55(Supplement C):188–195.
  • Liang, P. 2005. Semi-Supervised Learning for Natural Language. Ph.D. thesis, Massachusetts Institute of Technology.
  • McNemar, Q. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157.
  • Murphy, K. P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
  • Oronoz, M., A. Casillas, K. Gojenola, and A. Perez. 2013. Automatic annotation of medical records in Spanish with disease, drug and substance names. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, pages 536–543.
  • Östling, R. 2013. Stagger: An open-source part of speech tagger for Swedish. Northern European Journal of Language Technology, 3:1–18.
  • Pérez, A., K. Gojenola, A. Casillas, M. Oronoz, and A. D. a. de Ilarraza. 2015. Computer aided classification of diagnostic terms in Spanish. Expert Systems with Applications, 42:2949–2958.
  • Pérez, A., R. Weegar, A. Casillas, K. Gojenola, M. Oronoz, and H. Dalianis. 2017. Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora. Journal of Biomedical Informatics, 71:16–30.
  • Reynoso, G. A., A. D. March, C. M. Berra, R. P. Strobietto, M. Barani, M. Iubatti, M. P. Chiaradio, D. Serebrisky, A. Kahn, O. A. Vaccarezza, J. L. Leguiza, M. Ceitlin, D. A. Luna, F. G. B. de Quirós, M. I. Otegui, M. C. Puga, and M. Vallejos. 2000. Development of the Spanish Version of the Systematized Nomenclature of Medicine: Methodology and Main Issues. In Proceedings of the AMIA Symposium, pages 694–698.
  • Ruch, P., R. Baud, and A. Geissbühler. 2003. Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artificial intelligence in medicine, 29(1):169–184.
  • Saha, S. and A. Ekbal. 2013. Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition. Data & Knowledge Engineering, 85:15–39.
  • Skeppstedt, M., M. Kvist, G. Nilsson, and H. Dalianis. 2014. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study. Journal of Biomedical Informatics, 49, pages 148–158.
  • Tang, B., H. Cao, X. Wang, Q. Chen, and H. Xu. 2014. Evaluating word representation features in biomedical named entity recognition tasks. BioMed research international, 2014.
  • Turian, J., L. Ratinov, and Y. Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
  • Zea, J. L. C., J. E. O. Luna, C. Thorne, and G. Glavaš. 2016. Spanish NER with Word Representations and Conditional Random Fields. In Proceedings of the Sixth Named Entity Workshop, pages 34–40.