Building the Gold Standard for the Surface Syntax of Basque
- Díaz de Ilarraza Sánchez, Arantza
- Urizar Enbeita, Rubén
- González Dios, Itziar
- Aduriz Agirre, Itziar
- Aranzabe Urruzola, María Jesús
- Arriola Egurrola, José María
ISSN: 1135-5948
Year of publication: 2017
Issue: 58
Pages: 125-132
Type: Article
More publications in: Procesamiento del lenguaje natural
Abstract
En este artículo presentamos el proceso de construcción de SF-EPEC, un corpus de 300.000 palabras, sintácticamente anotado, que pretende ser un Gold Standard para el procesamiento sintáctico superficial del euskera. En primer lugar, describimos el conjunto de etiquetas diseñado para este propósito; siendo el euskera una lengua aglutinante, en ocasiones hemos tenido que crear etiquetas sintácticas compuestas. Asimismo, se detallan las distintas fases en la construcción de SF-EPEC.
Bibliographic References
- Aduriz, I. 2000. EUSMG: Morfologiatik sintaxira Murriztapen Gramatika erabiliz. Ph.D. thesis, University of the Basque Country (UPV/EHU).
- Aduriz, I., I. Aldezabal, I. Alegria, J. M. Arriola, A. DıÌaz de Ilarraza, N. Ezeiza, and K. Gojenola. 2003. Finite State Applications for Basque. In EACL’2003 Workshop on Finite-State Methods in Natural Language Processing, pages 3–11.
- Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. DıÌaz de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa, and R. Urizar. 2006a. Methodology and Steps Towards the Construction of EPEC, a Corpus of Written Basque Tagged at Morphological and Syntactic levels for Automatic Processing. Language and Computers, 56(1):1–15.
- Aduriz, I., M. J. Aranzabe, J. M. Arriola, and A. DıÌaz de Ilarraza. 2006b. Sintaxi Partziala. In B. FernaÌndez and I. Laka, editors, Andolin gogoan: Essays in Honour of Professor Eguzkitza. pages 31–49.
- Aduriz, I., J. M. Arriola, I. Gonzalez-Dios, and R. Urizar. 2015. Funtzio Sintaktikoen Gold Estandarra eskuz etiketatzeko gidalerroak. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 01-2015.
- Aduriz, I. and A. DıÌaz de Ilarraza. 2013. Morphosyntactic Disambiguation and Shallow Parsing in Computational Processing of Basque. Anuario del Seminario de FilologÌıa Vasca “Julio de Urquijo”, pages 1–21.
- Aldezabal, I., O. Ansa, B. Arrieta, X. Artola, A. Ezeiza, G. HernaÌndez, and M. Lersundi. 2001. EDBL: a General Lexical Basis for the Automatic Processing of Basque. In Proceedings of the IRCS Workshop on linguistic databases, pages 1–10.
- Aldezabal, I., K. Ceberio, I. Esparza, A. Estarrona, J. Etxeberria, M. Iruskieta, E. Izagirre, and L. Uria. 2007. EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa) segmentazio-mailan etiketatzeko eskuliburua. Technical report, University of the Basque Country (UPV/EHU) UPV/EHU/LSI/TR 11-2007.
- Alegria, I., X. Artola, K. Sarasola, and M. Urkia. 1996. Automatic Morphological Analysis of Basque. Literary and Linguistic Computing, 11(4):193–203.
- Aranzabe, M. J. and A. DıÌaz de Ilarraza. 2009. AnaÌlisis sintaÌctico computacional del euskera mediante una gramaÌtica de dependencias. In Actas del XI Simposio Internacional de ComunicacioÌn Social, pages 316–320. Centro de LinguÌÌıstica Aplicada.
- Arriola, J. M. 2015. Different Issues in the Design and Implementation of a Rule Based Grammar for the Surface Syntactic Disambiguation of Basque. In Proceedings of the Workshop on ”Constraint Grammarmethods, tools and applications” at NODALIDA 2015, number 113, pages 1–9. LinkoÌping University Electronic Press.
- Bick, E. 2000. The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, Aarhus University Press.
- Fleiss, J. L. 1971. Measuring Nominal Scale Agreement among many Raters. Psychological bulletin, 76(5):378–382.
- Karlsson, F., A. Voutilainen, J. Heikkila, and A. Anttila. 1995. Constraint Grammar, A Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter.
- Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330.
- Mille, S., A. Burga, V. Vidal, and L. Wanner. 2009. Towards a Rich Dependency Annotation of Spanish Corpora. Procesamiento del Lenguaje Natural, (43):325–333.
- Nilsson, J. and J. Hall. 2005. Reconstruction of the Swedish Treebank Talbanken. Technical report, VaÌxjoÌ University, Sweden. School of Mathematics and Systems Engineering. MSI report 05067.
- Sampson, G. 2011. A Two-way Exchange between Syntax and Corpora. In V. Vander, S. Zyngier, and G. Barnbrook, editors, Perspectives on Corpus Linguistics, volume XVI, 256. pages 197–211.
- Scheible, S., R. J. Whitt, M. Durrell, and P. Bennett. 2011. A Gold Standard Corpus of Early Modern German. In Proceedings of the ACL-HLT 25th Linguistic Annotation workshop, pages 124–128. Association for Computational Linguistics.
- Silveira, N., T. Dozat, M.-C. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, and C. D. Manning. 2014. A Gold Standard Dependency Corpus for English. In Proceedings of LREC 2014, the Ninth International Conference on Language Resources and Evaluation, pages 2897–2904.
- Solberg, P. E., A. Skjærholt, L. Øvrelid, K. Hagen, and J. B. Johannessen. 2014. The norwegian dependency treebank. In Proceedings of LREC’14, the Ninth International Conference on Language Resources and Evaluation, pages 789–795.
- Voutilainen, A., T. Purtonen, and K. Muhonen. 2012. Outsourcing Parsebanking: The FinnTreeBank Project. In Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Springer, pages 117–131.