Detecting the central units in two different genres and languagesa preliminary study of Brazilian Portuguese and Basque texts

  1. Mikel Iruskieta
  2. Gorka Labaka
  3. Juliano D. Antonio
Revue:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2016

Número: 56

Pages: 65-72

Type: Article

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

The aim of this paper is to present the development of a rule-based automatic detector which determines the main idea or the most pertinent discourse unit in two different languages such as Basque and Brazilian Portuguese and in two distinct genres such as scientific abstracts and argumentative answers. The central unit (CU) may be of interest to understand texts regarding relational discourse structure and it can be applied to Natural Language Processing (NLP) tasks such as automatic summarization, question-answer systems or sentiment analysis. In the case of argumentative answer genre, the identification of CU is an essential step for an eventual implementation of an automatic evaluator for this genre. The theoretical background which underlies the paper is Mann and Thompson’s (1988) Rhetorical Structure Theory (RST), following discourse segmentation and CU annotation. Results show that the CUs in different languages and in different genres are detected automatically with similar results, although there is space for improvement.

Références bibliographiques

  • Antonio, J. D. 2015. Detecting central units in argumentative answer genre: signals that influence annotators' agreement. In 5th Workshop "RST and Discourse Studies", in Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Alicante (España).
  • Antonio, J.D., and F.T.R. Cassim. 2012. Coherence relations in academic spoken discourse. Linguistica 52, pp. 323–336.
  • Burstein, J.C., D. Marcu, S. Andreyev, and M.S. Chodorow. 2001. Towards automatic classification of discourse elements in essays. In Proceedings of the 39th annual Meeting on Association for Computational Linguistics, pp. 98–105. Association for Computational Linguistics.
  • Cardoso, P.C.F., E.G. Maziero, M.L.C. Jorge, E.M.R. Seno, A. Di Felippo, L.H.M. Rino, M.G.V. Nunes, and T.A.S. Pardo, 2011. CSTNews A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese. Proceedings of the 3rd RST Brazilian Meeting, pp. 88–105. Cuiabá/MT, Brasil.
  • Carlson, L., M.E. Okurowski, and D. Marcu. 2002. RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia: PA: Linguistic Data Consortium.
  • Corston-Oliver, S. 1998. Identifying the linguistic correlates of rhetorical relations, Proceedings of the ACL Workshop on Discourse Relations and Discourse Markers 1998, pp. 8–14.
  • Da Cunha, I., E. San Juan, J.M. Torres-Moreno, M. LLoberese, and I. Castellóne. 2012. DiSeg 1.0: The first system for Spanish discourse segmentation. Expert Systems with Applications, 39(2), pp. 1671–1678.
  • Da Cunha, I., and M. Iruskieta. 2010. Comparing rhetorical structures in different languages: The influence of translation strategies. Discourse Studies, 12(5), pp. 563–598.
  • Da Cunha, I., J.M. Torres-Moreno, and G. Sierra. 2011. On the Development of the RST Spanish Treebank, 5th Linguistic Annotation Workshop (LAW V '11), 23 June 2011, Association for Computational Linguistics, pp. 1–10.
  • Egg, M. and Redeker, G. 2010. How complex is discourse structure? In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 1619–1623, Valletta, Malta, 19-21 May.
  • Hanneforth, T. Heintze, S. and Stede, M. 2003. Rhetorical parsing with underspecification and forests, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers-Volume 2 2003, Association for Computational Linguistics, pp. 31–33.
  • Iruskieta, M. Diaz de Ilarraza, A. Labaka, G. Lersundi, M. 2015. The Detection of Central Units in Basque scientific abstracts. In 5th Workshop "RST and Discourse Studies", in Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Alicante (España).
  • Iruskieta, M. Díaz de Ilarraza, A. Lersundi, M. 2014. The annotation of the Central Unit in Rhetorical Structure Trees: A Key Step in Annotating Rhetorical Relations. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 466–475, Dublin, Ireland. August 23-29.
  • Iruskieta, M. Aranzabe, M.J. Diaz de Ilarraza, A. Gonzalez, I. Lersundi, I. Lopez de Lacalle, O. 2013. The RST Basque TreeBank: an online search interface to check rhetorical relations, 4th Workshop RST and Discourse Studies, Sociedad Brasileira de Computação, Fortaleza, CE, Brasil. October 2013.
  • Iruskieta M. and Zapirain B. 2015. EusEduSeg: a Dependency-Based EDU Segmenation for Basque. In Actas del XXXI Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2015), Spain. September 2015.
  • Joty, S. Carenini, G. and Ng, R.T. 2015. CODRA: A Novel Discriminative Framework for Rhetorical Analysis. Computational Linguistics, 41(3), pp. 385– 435.
  • Mann, W.C., and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a functional theory of text organization. TextInterdisciplinary Journal for the Study of Discourse, 8(3), pp. 243–281.
  • Marcu, D. 2000. The theory and practice of discourse parsing and summarization. Cambridge: The MIT press.
  • Matthiessen, C. 2005. Remembering Bill Mann. Computational Linguistics, v. 31, n. 2, pp. 161–172.
  • Matthiessen, C., and S. Thompson. 1988. The structure of discourse and ‘subordination’. In: Haiman, J. and Thompson, S. (Eds.) Clause Combining in Grammar and Discourse. Amsterdam/Philadelphia: J. Benjamins, pp. 275–329.
  • Menegassi, R.J. 2011. A Escrita na Formação Docente Inicial: Influências da Iniciação à Pesquisa. Signum: Estudos da Linguagem, 14(1), pp. 387–419.
  • O'Donnell, M. 2000. RSTTool 2.4: a markup tool for Rhetorical Structure Theory. First International Conference on Natural Language Generation. pp. 253–256.
  • Pardo, T.A.S., and M.G.V. Nunes. 2008. On the Development and Evaluation of a Brazilian Portuguese Discourse Parser. Journal of Theoretical and Applied Computing, Vol. 15, N. 2, pp. 43–64.
  • Pardo, T.A.S., and E.R.M. Seno. 2005. Rhetalho: um corpus de referência anotado retoricamente [Rhetalho: a rhetorically annotated reference corpus], Anais do V Encontro de Corpora, 24-25 November 2005.
  • Pardo, T.A.S., L.H.M. Rino, and M.G.V. Nunes. 2003. GistSumm: A summarization tool based on a new extractive method. Computational Processing of the Portuguese Language, pp. 210–218.
  • Stede, M. 2008. RST revisited: disentangling nuclearity, pp. 33–57. 'Subordination' versus 'coordination' in sentence and text. John Benjamins, Amsterdam and Philadelphia.
  • Scott, D.R., J. Delin, and A.F. Hartley. 1998. Identifying congruent pragmatic relations in procedural texts. Languages in Contrast, 1(1), 45–82.
  • Stede, M. 2004. The Potsdam commentary corpus, 2004 ACL Workshop on Discourse Annotation, 25-26 July 2004, Association for Computational Linguistics, pp. 96–102.
  • Swales, J.M. 1990. Genre analysis: English in academic and research settings. Cambridge, UK: Cambridge University Press.
  • Sumita, K., K. Ono, T. Chino, and T. Ukita. 1992. A discourse structure analyzer for Japanese text, 1992, ICOT, pp. 1133–1140.
  • Taboada, M., and W.C. Mann. 2006. Applications of Rhetorical Structure Theory. Discourse Studies, 8(4), pp. 567–588.
  • Taboada, M., and J. Renkema. 2011. Discourse Relations Reference Corpus [Corpus]. Simon Fraser University and Tilburg University. Available from http://www.sfu.ca/rst/06tools/discourse_relat ions_corpus.html.
  • Tofiloski, M., J. Brooke, and M. Taboada. 2009. A syntactic and lexical-based discourse segmenter, 47th Annual Meeting of the Association for Computational Linguistics, 2-7 August 2009, ACL, pp. 77– 80.
  • Van Dijk, T. 1980. Macrostructures: an Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition. Lawrence Erlbaum, Hillsdale.