Detecting the central units in two different genres and languagesa preliminary study of Brazilian Portuguese and Basque texts

  1. Mikel Iruskieta
  2. Gorka Labaka
  3. Juliano D. Antonio
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2016

Número: 56

Páginas: 65-72

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

El objetivo de este trabajo es presentar las mejoras de un detector automático basado en reglas que determina la idea principal o unidad discursiva más pertinente de dos lenguas tan diferentes como el euskera y el portugués de Brasil y en dos géneros muy distintos como son los resúmenes de los artículos científicos y las respuestas argumentativas. La unidad central (CU, por sus siglas en inglés) puede ser de interés para entender los textos partiendo de la estructura discursiva relacional y poderlo aplicar en tareas de Procesamiento del Lenguaje Natural (PLN) tales como resumen automático, sistemas de pregunta-respuesta o análisis de sentimiento. En los textos de respuesta argumentativa, identificar la CU es un paso esencial para un evaluador automático de considere la estructura discursiva de dichos textos. El marco teórico en el que hemos desarrollado el trabajo es la Rhetorical Structure Theory (RST) de Mann y Thompson (1988), que parte de la segmentación discursiva y finaliza con la anotación de la unidad central. Los resultados demuestran que las unidades centrales en diferentes lenguas y géneros son detectadas con similares resultados automáticamente, aunque todavía hay espacio para mejora.

Referencias bibliográficas

  • Antonio, J. D. 2015. Detecting central units in argumentative answer genre: signals that influence annotators' agreement. In 5th Workshop "RST and Discourse Studies", in Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Alicante (España).
  • Antonio, J.D., and F.T.R. Cassim. 2012. Coherence relations in academic spoken discourse. Linguistica 52, pp. 323–336.
  • Burstein, J.C., D. Marcu, S. Andreyev, and M.S. Chodorow. 2001. Towards automatic classification of discourse elements in essays. In Proceedings of the 39th annual Meeting on Association for Computational Linguistics, pp. 98–105. Association for Computational Linguistics.
  • Cardoso, P.C.F., E.G. Maziero, M.L.C. Jorge, E.M.R. Seno, A. Di Felippo, L.H.M. Rino, M.G.V. Nunes, and T.A.S. Pardo, 2011. CSTNews A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese. Proceedings of the 3rd RST Brazilian Meeting, pp. 88–105. Cuiabá/MT, Brasil.
  • Carlson, L., M.E. Okurowski, and D. Marcu. 2002. RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia: PA: Linguistic Data Consortium.
  • Corston-Oliver, S. 1998. Identifying the linguistic correlates of rhetorical relations, Proceedings of the ACL Workshop on Discourse Relations and Discourse Markers 1998, pp. 8–14.
  • Da Cunha, I., E. San Juan, J.M. Torres-Moreno, M. LLoberese, and I. Castellóne. 2012. DiSeg 1.0: The first system for Spanish discourse segmentation. Expert Systems with Applications, 39(2), pp. 1671–1678.
  • Da Cunha, I., and M. Iruskieta. 2010. Comparing rhetorical structures in different languages: The influence of translation strategies. Discourse Studies, 12(5), pp. 563–598.
  • Da Cunha, I., J.M. Torres-Moreno, and G. Sierra. 2011. On the Development of the RST Spanish Treebank, 5th Linguistic Annotation Workshop (LAW V '11), 23 June 2011, Association for Computational Linguistics, pp. 1–10.
  • Egg, M. and Redeker, G. 2010. How complex is discourse structure? In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 1619–1623, Valletta, Malta, 19-21 May.
  • Hanneforth, T. Heintze, S. and Stede, M. 2003. Rhetorical parsing with underspecification and forests, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers-Volume 2 2003, Association for Computational Linguistics, pp. 31–33.
  • Iruskieta, M. Diaz de Ilarraza, A. Labaka, G. Lersundi, M. 2015. The Detection of Central Units in Basque scientific abstracts. In 5th Workshop "RST and Discourse Studies", in Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Alicante (España).
  • Iruskieta, M. Díaz de Ilarraza, A. Lersundi, M. 2014. The annotation of the Central Unit in Rhetorical Structure Trees: A Key Step in Annotating Rhetorical Relations. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 466–475, Dublin, Ireland. August 23-29.
  • Iruskieta, M. Aranzabe, M.J. Diaz de Ilarraza, A. Gonzalez, I. Lersundi, I. Lopez de Lacalle, O. 2013. The RST Basque TreeBank: an online search interface to check rhetorical relations, 4th Workshop RST and Discourse Studies, Sociedad Brasileira de Computação, Fortaleza, CE, Brasil. October 2013.
  • Iruskieta M. and Zapirain B. 2015. EusEduSeg: a Dependency-Based EDU Segmenation for Basque. In Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Spain. September 2015.
  • Joty, S. Carenini, G. and Ng, R.T. 2015. CODRA: A Novel Discriminative Framework for Rhetorical Analysis. Computational Linguistics, 41(3), pp. 385– 435.
  • Mann, W.C., and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a functional theory of text organization. TextInterdisciplinary Journal for the Study of Discourse, 8(3), pp. 243–281.
  • Marcu, D. 2000. The theory and practice of discourse parsing and summarization. Cambridge: The MIT press.
  • Matthiessen, C. 2005. Remembering Bill Mann. Computational Linguistics, v. 31, n. 2, pp. 161–172.
  • Matthiessen, C., and S. Thompson. 1988. The structure of discourse and ‘subordination’. In: Haiman, J. and Thompson, S. (Eds.) Clause Combining in Grammar and Discourse. Amsterdam/Philadelphia: J. Benjamins, pp. 275–329.
  • Menegassi, R.J. 2011. A Escrita na Formação Docente Inicial: Influências da Iniciação à Pesquisa. Signum: Estudos da Linguagem, 14(1), pp. 387–419.
  • O'Donnell, M. 2000. RSTTool 2.4: a markup tool for Rhetorical Structure Theory. First International Conference on Natural Language Generation. pp. 253–256.
  • Pardo, T.A.S., and M.G.V. Nunes. 2008. On the Development and Evaluation of a Brazilian Portuguese Discourse Parser. Journal of Theoretical and Applied Computing, Vol. 15, N. 2, pp. 43–64.
  • Pardo, T.A.S., and E.R.M. Seno. 2005. Rhetalho: um corpus de referência anotado retoricamente [Rhetalho: a rhetorically annotated reference corpus], Anais do V Encontro de Corpora, 24-25 November 2005.
  • Pardo, T.A.S., L.H.M. Rino, and M.G.V. Nunes. 2003. GistSumm: A summarization tool based on a new extractive method. Computational Processing of the Portuguese Language, pp. 210–218.
  • Stede, M. 2008. RST revisited: disentangling nuclearity, pp. 33–57. 'Subordination' versus 'coordination' in sentence and text. John Benjamins, Amsterdam and Philadelphia.
  • Scott, D.R., J. Delin, and A.F. Hartley. 1998. Identifying congruent pragmatic relations in procedural texts. Languages in Contrast, 1(1), 45–82.
  • Stede, M. 2004. The Potsdam commentary corpus, 2004 ACL Workshop on Discourse Annotation, 25-26 July 2004, Association for Computational Linguistics, pp. 96–102.
  • Swales, J.M. 1990. Genre analysis: English in academic and research settings. Cambridge, UK: Cambridge University Press.
  • Sumita, K., K. Ono, T. Chino, and T. Ukita. 1992. A discourse structure analyzer for Japanese text, 1992, ICOT, pp. 1133–1140.
  • Taboada, M., and W.C. Mann. 2006. Applications of Rhetorical Structure Theory. Discourse Studies, 8(4), pp. 567–588.
  • Taboada, M., and J. Renkema. 2011. Discourse Relations Reference Corpus [Corpus]. Simon Fraser University and Tilburg University. Available from http://www.sfu.ca/rst/06tools/discourse_relat ions_corpus.html.
  • Tofiloski, M., J. Brooke, and M. Taboada. 2009. A syntactic and lexical-based discourse segmenter, 47th Annual Meeting of the Association for Computational Linguistics, 2-7 August 2009, ACL, pp. 77– 80.
  • Van Dijk, T. 1980. Macrostructures: an Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition. Lawrence Erlbaum, Hillsdale.