Errores ortográficos y de competencia en textos de la web en euskera

  1. Alegría Loinaz, Iñaki
  2. Etxeberria Uztarroz, Izaskun
  3. Leturia Azkarate, Igor
Revue:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2010

Número: 45

Pages: 137-144

Type: Article

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

The objective of the work presented in this paper is to estimate the quality of corpora retrieved from the Basque Web. The methodology followed is similar to that used for English and Germany by Ringlstetter et al. (2006). The main difference lies in the fact that we reuse spelling checkers for detecting errors. We think that by this way we obtain a higher error coverage and that the method can be applied to other languages with practically no manual work provided such tools are available for them. The results obtained can be useful for improving the quality of corpora obtained from the web, eliminating documents containing errors over a given threshold.

Références bibliographiques

  • Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. 2002. Using Finite State Technology in Natural Language Processing of Basque. LNCS: Implementation and Application of Automata. 2002. Springer.
  • Alegria I., Etxeberria I., Hulden H., Maritxalar M. 2009. Porting Basque Morphological Grammars to foma, an Open-Source Tool. FSMNLP2009. Pretoria. South Africa.
  • Beesley K. R. and Karttunen L. 2003. Finite State Morphology. CSLI Publications, Palo Alto, CA. Hulden M. 2009. Foma: a Finite-State Compiler and Library. EACL 2009. Demo session. pp 29-32.
  • Kilgarriff, A. and Grefenstette, G. 2003. Introduction to the special issue on the web as corpus. Computational linguistics, 29(3): 333- 347. MIT Press.
  • Kukich K. 1992. Techniques for Automatically Correc-ting Words in Text. ACM Comput. Surv. 24(4): 377-439.
  • Leturia I., San Vicente I., Saralegi X. and Lopez de Lacalle M. 2008. Collecting Basque specialized corpora from the web: language-specific performance tweaks and improving topic precision. Proc. of the 4th. Web as Corpus Workshop. LREC 2008.
  • Ringlstetter, C. and Schulz, K.U. and Mihov, S. 2006. Orthographic errors in web pages: Toward cleaner web corpora. Computational Linguistics, 32(3): 295-340. MIT Press.
  • Sharoff, S. 2006. Creating General-Purpose Corpora Using Automated Search Engine Queries. WaCky! Working Papers on the Web as Corpus, 63-98. Ed. Marco Baroni and Silvia Bernardini. Bologna.
  • Whitelaw C., Hutchinson B., Chung, G.Y. and Ellis G. 2009. Using the web for language independent spellchecking and autocorrection. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, 890-899.