Building the Gold Standard for the Surface Syntax of Basque
- Díaz de Ilarraza Sánchez, Arantza
- Urizar Enbeita, Rubén
- González Dios, Itziar
- Aduriz Agirre, Itziar
- Aranzabe Urruzola, María Jesús
- Arriola Egurrola, José María
ISSN: 1135-5948
Year of publication: 2017
Issue: 58
Pages: 125-132
Type: Article
More publications in: Procesamiento del lenguaje natural
Abstract
In this paper, we present the process in the construction of SF-EPEC, a 300,000-word corpus syntactically annotated that aims to be a Gold Standard for the surface syntactic processing of Basque. First, the tagset designed for this purpose is described; being Basque an agglutinative language, sometimes complex syntactic tags were needed. We also account for the different phases in the construction of SF-EPEC.
Bibliographic References
- Aduriz, I. 2000. EUSMG: Morfologiatik sintaxira Murriztapen Gramatika erabiliz. Ph.D. thesis, University of the Basque Country (UPV/EHU).
- Aduriz, I., I. Aldezabal, I. Alegria, J. M. Arriola, A. DıÌaz de Ilarraza, N. Ezeiza, and K. Gojenola. 2003. Finite State Applications for Basque. In EACL’2003 Workshop on Finite-State Methods in Natural Language Processing, pages 3–11.
- Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. DıÌaz de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. 2006a. Methodology and Steps Towards the Construction of EPEC, a Corpus of Written Basque Tagged at Morphological and Syntactic levels for Automatic Processing. Language and Computers, 56(1):1–15.
- Aduriz, I., M. J. Aranzabe, J. M. Arriola, and A. DıÌaz de Ilarraza. 2006b. Sintaxi Partziala. In B. FernaÌndez and I. Laka, editors, Andolin gogoan: Essays in Honour of Professor Eguzkitza. pages 31–49.
- Aduriz, I., J. M. Arriola, I. Gonzalez-Dios, and R. Urizar. 2015. Funtzio Sintaktikoen Gold Estandarra eskuz etiketatzeko gidalerroak. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 01-2015.
- Aduriz, I. and A. DıÌaz de Ilarraza. 2013. Morphosyntactic Disambiguation and Shallow Parsing in Computational Processing of Basque. Anuario del Seminario de FilologÌıa Vasca “Julio de Urquijo”, pages 1–21.
- Aldezabal, I., O. Ansa, B. Arrieta, X. Artola, A. Ezeiza, G. HernaÌndez, and M. Lersundi. 2001. EDBL: a General Lexical Basis for the Automatic Processing of Basque. In Proceedings of the IRCS Workshop on linguistic databases, pages 1–10.
- Aldezabal, I., K. Ceberio, I. Esparza, A. Estarrona, J. Etxeberria, M. Iruskieta, E. Izagirre, and L. Uria. 2007. EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa) segmentazio-mailan etiketatzeko eskuliburua. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 11-2007.
- Alegria, I., X. Artola, K. Sarasola, and M. Urkia. 1996. Automatic Morphological Analysis of Basque. Literary and Linguistic Computing, 11(4):193–203.
- Aranzabe, M. J. and A. DıÌaz de Ilarraza. 2009. AnaÌlisis sintaÌctico computacional del euskera mediante una gramaÌtica de dependencias. In Actas del XI Simposio Internacional de ComunicacioÌn Social, pages 316–320. Centro de LinguÌÌıstica Aplicada.
- Arriola, J. M. 2015. Different Issues in the Design and Implementation of a Rule Based Grammar for the Surface Syntactic Disambiguation of Basque. In Proceedings of the Workshop on ”Constraint Grammarmethods, tools and applications” at NODALIDA 2015, number 113, pages 1–9. LinkoÌping University Electronic Press.
- Bick, E. 2000. The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University Press.
- Fleiss, J. L. 1971. Measuring Nominal Scale Agreement among many Raters. Psychological bulletin, 76(5):378–382.
- Karlsson, F., A. Voutilainen, J. Heikkila, and A. Anttila. 1995. Constraint Grammar, A Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter.
- Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330.
- Mille, S., A. Burga, V. Vidal, and L. Wanner. 2009. Towards a Rich Dependency Annotation of Spanish Corpora. Procesamiento del Lenguaje Natural, (43):325–333.
- Nilsson, J. and J. Hall. 2005. Reconstruction of the Swedish Treebank Talbanken. Technical report, VaÌxjoÌ University, Sweden. School of Mathematics and Systems Engineering. MSI report 05067.
- Sampson, G. 2011. A Two-way Exchange between Syntax and Corpora. In V. Vander, S. Zyngier, and G. Barnbrook, editors, Perspectives on Corpus Linguistics, volume XVI, 256. pages 197–211.
- Scheible, S., R. J. Whitt, M. Durrell, and P. Bennett. 2011. A Gold Standard Corpus of Early Modern German. In Proceedings of the ACL-HLT 25th Linguistic Annotation workshop, pages 124–128. Association for Computational Linguistics.
- Silveira, N., T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning. 2014. A Gold Standard Dependency Corpus for English. In Proceedings of LREC 2014, the Ninth International Conference on Language Resources and Evaluation, pages 2897–2904.
- Solberg, P. E., A. Skjærholt, L. Øvrelid, K. Hagen, and J. B. Johannessen. 2014. The norwegian dependency treebank. In Proceedings of LREC’14, the Ninth International Conference on Language Resources and Evaluation, pages 789–795.
- Voutilainen, A., T. Purtonen, and K. Muhonen. 2012. Outsourcing Parsebanking: The FinnTreeBank Project. In Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Springer, pages 117–131.