Spelling Normalisation of Basque Historical Texts

  1. Soraluze, Ander
  2. Padilla, Manuel
  3. Estarrona Ibarloza, Ainara
  4. Etxeberria Uztarroz, Izaskun
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2019

Issue: 63

Pages: 59-66

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

This paper presents a computational method and its evaluation in a real scenario with the aim of normalising Basque historical texts in order to be analysed using standard Natural Language Processing tools (NLP). This normalisation work is part of a more general ongoing project called Basque in the Making (BIM): A Historical Look at a European Language Isolate, whose main objective is the systematic and diachronic study of a number of grammatical features of the Basque language.

Bibliographic References

  • Alegria, I., O. Arregi, N. Ezeiza, and I. Fernández. 2006. Lessons from the development of a named entity recognizer for basque. Procesamiento del lenguaje natural, 36.
  • Alegria, I., X. Artola, K. Sarasola, and M. Urkia. 1996. Automatic morphological analysis of basque. Literary and Linguistic Computing, 11(4):193–203.
  • Alegria, I., I. Etxeberria, M. Hulden, and M. Maritxalar. 2009. Porting basque morphological grammars to foma, an open-source tool. In Finite-State Methods and Natural Language Processing. Springer, pages 105–113.
  • Bollmann, M., S. Dipper, J. Krasselt, and F. Petran. 2012. Manual and semi-automatic normalization of historical spelling-case studies from early new high german. In KONVENS, pages 342–350.
  • Bollmann, M. and A. Søgaard. 2016. Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. arXiv preprint arXiv:1610.07844.
  • Etxeberria, I. 2016. Aldaera linguistikoen normalizazioa inferentzia fonologikoa eta morfologikoa erabiliz. Ph.D. thesis, Universidad del Pais Vasco / Euskal Herriko Unibertsiatea.
  • Etxeberria, I., I. Alegria, M. Hulden, and L. Uria. 2014. Learning to map variationstandard forms using a limited parallel corpus and the standard morphology. Procesamiento del Lenguaje Natural, 52:13–20.
  • Etxeberria, I., I. Alegria, and L. Uria. 2019. Weighted finite-state transducers for normalization of historical texts. Natural Language Engineering, 25(2):307–321.
  • Etxeberria, I., I. Alegria, L. Uria, and M. Hulden. 2016. Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and
  • Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 1064–1069.
  • Jurish, B. 2010. Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 72–77.
  • Kestemont, M., W. Daelemans, and G. D. Pauw. 2010. Weigh your words— memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3):287–301.
  • Moeller, S., G. Kazeminejad, A. Cowell, and M. Hulden. 2019. Improving low-resource morphological learning with intermediate forms from finite state transducers. In Proceedings of the Workshop on Computational Methods for Endangered Languages, volume 1, pages 81–86.
  • Novak, J. R., N. Minematsu, and K. Hirose. 2012. WFST-Based Graphemeto-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45–49.
  • Pettersson, E., B. Megyesi, and J. Nivre. 2014. A multilingual evaluation of three spelling normalisation methods for historical text. Proceedings of LaTeCH, pages 32–41.
  • Pettersson, E., B. Megyesi, and J. Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013, NEALT Proceedings Series, volume 18, pages 54–69.
  • Scherrer, Y. and T. Erjavec. 2016. Modernising historical Slovene words. Natural Language Engineering, 22(6):881–905.
  • Tang, G., F. Cap, E. Pettersson, and J. Nivre. 2018. An evaluation of neural machine translation models on historical spelling normalization. arXiv preprint arXiv:1806.05210.
  • Uria, L. and R. Etxepare. 2011. Basyque: Aplicación para el estudio de la variación sintáctica. Linguamática, 3(1):35–44.