Spelling Normalisation of Basque Historical Texts

  1. Soraluze, Ander
  2. Padilla, Manuel
  3. Estarrona Ibarloza, Ainara
  4. Etxeberria Uztarroz, Izaskun
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2019

Número: 63

Páginas: 59-66

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En este artículo se presenta y evalúa en un entorno real un método computacional con el objetivo de normalizar textos históricos vascos para que, una vez normalizados, puedan ser analizados con herramientas estándar de Procesamiento del Lenguaje Natural (PLN). Este trabajo de normalización forma parte de un proyecto en curso más general llamado Basque in the Making (BIM): A Historical Look at a European Language Isolate, cuyo objetivo principal es el estudio sistemático y diacrónico de ciertas características gramaticales de la lengua vasca.

Referencias bibliográficas

  • Alegria, I., O. Arregi, N. Ezeiza, and I. Fernández. 2006. Lessons from the development of a named entity recognizer for basque. Procesamiento del lenguaje natural, 36.
  • Alegria, I., X. Artola, K. Sarasola, and M. Urkia. 1996. Automatic morphological analysis of basque. Literary and Linguistic Computing, 11(4):193–203.
  • Alegria, I., I. Etxeberria, M. Hulden, and M. Maritxalar. 2009. Porting basque morphological grammars to foma, an open-source tool. In Finite-State Methods and Natural Language Processing. Springer, pages 105–113.
  • Bollmann, M., S. Dipper, J. Krasselt, and F. Petran. 2012. Manual and semi-automatic normalization of historical spelling-case studies from early new high german. In KONVENS, pages 342–350.
  • Bollmann, M. and A. Søgaard. 2016. Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. arXiv preprint arXiv:1610.07844.
  • Etxeberria, I. 2016. Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz. Ph.D. thesis, Universidad del Pais Vasco / Euskal Herriko Unibertsiatea.
  • Etxeberria, I., I. Alegria, M. Hulden, and L. Uria. 2014. Learning to map variationstandard forms using a limited parallel corpus and the standard morphology. Procesamiento del Lenguaje Natural, 52:13–20.
  • Etxeberria, I., I. Alegria, and L. Uria. 2019. Weighted finite-state transducers for normalization of historical texts. Natural Language Engineering, 25(2):307–321.
  • Etxeberria, I., I. Alegria, L. Uria, and M. Hulden. 2016. Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and
  • Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 1064–1069.
  • Jurish, B. 2010. Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 72–77.
  • Kestemont, M., W. Daelemans, and G. D. Pauw. 2010. Weigh your words— memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3):287–301.
  • Moeller, S., G. Kazeminejad, A. Cowell, and M. Hulden. 2019. Improving low-resource morphological learning with intermediate forms from finite state transducers. In Proceedings of the Workshop on Computational Methods for Endangered Languages, volume 1, pages 81–86.
  • Novak, J. R., N. Minematsu, and K. Hirose. 2012. WFST-Based Graphemeto-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45–49.
  • Pettersson, E., B. Megyesi, and J. Nivre. 2014. A multilingual evaluation of three spelling normalisation methods for historical text. Proceedings of LaTeCH, pages 32–41.
  • Pettersson, E., B. Megyesi, and J. Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013, NEALT Proceedings Series, volume 18, pages 54–69.
  • Scherrer, Y. and T. Erjavec. 2016. Modernising historical Slovene words. Natural Language Engineering, 22(6):881–905.
  • Tang, G., F. Cap, E. Pettersson, and J. Nivre. 2018. An evaluation of neural machine translation models on historical spelling normalization. arXiv preprint arXiv:1806.05210.
  • Uria, L. and R. Etxepare. 2011. Basyque: Aplicación para el estudio de la variación sintáctica. Linguamática, 3(1):35–44.