Building the Gold Standard for the Surface Syntax of Basque

  1. Díaz de Ilarraza Sánchez, Arantza
  2. Urizar Enbeita, Rubén
  3. González Dios, Itziar
  4. Aduriz Agirre, Itziar
  5. Aranzabe Urruzola, María Jesús
  6. Arriola Egurrola, José María
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2017

Issue: 58

Pages: 125-132

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

In this paper, we present the process in the construction of SF-EPEC, a 300,000-word corpus syntactically annotated that aims to be a Gold Standard for the surface syntactic processing of Basque. First, the tagset designed for this purpose is described; being Basque an agglutinative language, sometimes complex syntactic tags were needed. We also account for the different phases in the construction of SF-EPEC.

Bibliographic References

  • Aduriz, I. 2000. EUSMG: Morfologiatik sintaxira Murriztapen Gramatika erabiliz. Ph.D. thesis, University of the Basque Country (UPV/EHU).
  • Aduriz, I., I. Aldezabal, I. Alegria, J. M. Arriola, A. Dı́az de Ilarraza, N. Ezeiza, and K. Gojenola. 2003. Finite State Applications for Basque. In EACL’2003 Workshop on Finite-State Methods in Natural Language Processing, pages 3–11.
  • Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. Dı́az de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. 2006a. Methodology and Steps Towards the Construction of EPEC, a Corpus of Written Basque Tagged at Morphological and Syntactic levels for Automatic Processing. Language and Computers, 56(1):1–15.
  • Aduriz, I., M. J. Aranzabe, J. M. Arriola, and A. Dı́az de Ilarraza. 2006b. Sintaxi Partziala. In B. Fernández and I. Laka, editors, Andolin gogoan: Essays in Honour of Professor Eguzkitza. pages 31–49.
  • Aduriz, I., J. M. Arriola, I. Gonzalez-Dios, and R. Urizar. 2015. Funtzio Sintaktikoen Gold Estandarra eskuz etiketatzeko gidalerroak. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 01-2015.
  • Aduriz, I. and A. Dı́az de Ilarraza. 2013. Morphosyntactic Disambiguation and Shallow Parsing in Computational Processing of Basque. Anuario del Seminario de Filoloǵıa Vasca “Julio de Urquijo”, pages 1–21.
  • Aldezabal, I., O. Ansa, B. Arrieta, X. Artola, A. Ezeiza, G. Hernández, and M. Lersundi. 2001. EDBL: a General Lexical Basis for the Automatic Processing of Basque. In Proceedings of the IRCS Workshop on linguistic databases, pages 1–10.
  • Aldezabal, I., K. Ceberio, I. Esparza, A. Estarrona, J. Etxeberria, M. Iruskieta, E. Izagirre, and L. Uria. 2007. EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa) segmentazio-mailan etiketatzeko eskuliburua. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 11-2007.
  • Alegria, I., X. Artola, K. Sarasola, and M. Urkia. 1996. Automatic Morphological Analysis of Basque. Literary and Linguistic Computing, 11(4):193–203.
  • Aranzabe, M. J. and A. Dı́az de Ilarraza. 2009. Análisis sintáctico computacional del euskera mediante una gramática de dependencias. In Actas del XI Simposio Internacional de Comunicación Social, pages 316–320. Centro de Lingǘıstica Aplicada.
  • Arriola, J. M. 2015. Different Issues in the Design and Implementation of a Rule Based Grammar for the Surface Syntactic Disambiguation of Basque. In Proceedings of the Workshop on ”Constraint Grammarmethods, tools and applications” at NODALIDA 2015, number 113, pages 1–9. Linköping University Electronic Press.
  • Bick, E. 2000. The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University Press.
  • Fleiss, J. L. 1971. Measuring Nominal Scale Agreement among many Raters. Psychological bulletin, 76(5):378–382.
  • Karlsson, F., A. Voutilainen, J. Heikkila, and A. Anttila. 1995. Constraint Grammar, A Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter.
  • Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330.
  • Mille, S., A. Burga, V. Vidal, and L. Wanner. 2009. Towards a Rich Dependency Annotation of Spanish Corpora. Procesamiento del Lenguaje Natural, (43):325–333.
  • Nilsson, J. and J. Hall. 2005. Reconstruction of the Swedish Treebank Talbanken. Technical report, Växjö University, Sweden. School of Mathematics and Systems Engineering. MSI report 05067.
  • Sampson, G. 2011. A Two-way Exchange between Syntax and Corpora. In V. Vander, S. Zyngier, and G. Barnbrook, editors, Perspectives on Corpus Linguistics, volume XVI, 256. pages 197–211.
  • Scheible, S., R. J. Whitt, M. Durrell, and P. Bennett. 2011. A Gold Standard Corpus of Early Modern German. In Proceedings of the ACL-HLT 25th Linguistic Annotation workshop, pages 124–128. Association for Computational Linguistics.
  • Silveira, N., T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning. 2014. A Gold Standard Dependency Corpus for English. In Proceedings of LREC 2014, the Ninth International Conference on Language Resources and Evaluation, pages 2897–2904.
  • Solberg, P. E., A. Skjærholt, L. Øvrelid, K. Hagen, and J. B. Johannessen. 2014. The norwegian dependency treebank. In Proceedings of LREC’14, the Ninth International Conference on Language Resources and Evaluation, pages 789–795.
  • Voutilainen, A., T. Purtonen, and K. Muhonen. 2012. Outsourcing Parsebanking: The FinnTreeBank Project. In Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Springer, pages 117–131.