Using annotated discourse information of a rst spanish-chinese treebank for translation and language learning tasks

Cao, Shuyuan

Using annotated discourse information of a rst spanish-chinese treebank for translation and language learning tasks

Cao, Shuyuan

Dirigée par:

Iria da Cunha Fanego Directeur/trice
Mikel Iruskieta Quintian Co-directeur/trice

Université de défendre: Universitat Pompeu Fabra

Fecha de defensa: 09 novembre 2018

Jury:

M. Aranzazu Diaz de Ilarraza Sanchez President
Mireia Vargas Urpí Secrétaire
Juliano Desiderato Antonio Rapporteur

Type: Thèses

Teseo: 574265 DIALNET TDX editor

Résumé

As one of the essential elements for Natural Language Processing (NLP), discourse has called much attention during recent years. Many studies explore the role of how discourse elements affect in different NLP research areas, such as parsing, sentiment analysis, machine translation evaluation, among others. Besides, along with the discourse analysis development, different treebanks annotated with discourse information for different languages form a great contribution for advancing the NLP researches. Spanish and Chinese are two of the most spoken languages in the world; the language pair occupy an important position for NLP studies. Therefore, this study aims to make a discourse analysis between the two languages in terms of annotating discourse similarities and differences under the theoretical framework of Rhetorical Structure Theory (RST) by Mann and Thompson (1988). Our goal, which is the main objective of this study, based on the annotation results, the study seeks to develop a protocol that includes recommendations for Spanish-Chinese translation. In addition, with a globalized context in the current society, the communication between Spanish and Chinese is more and more intensive. Therefore, another intention of our study is to develop some resources for the language learning between Spanish-Chinese. To achieve our goals, for the development of the protocol, we firstly establish a Spanish-Chinese parallel corpus and annotate the discourse information of the entire corpus. Then we evaluate the annotation results following a qualitative method to guarantee the high quality of the annotation results. Lastly, we conclude the discourse similarities and differences to make the protocol. Regarding the language learning between the two languages, we fully use the manually annotated discourse markers (DM) to develop a question-answering module. In recent years, there have been few contrastive works of Spanish and Chinese for discourse analysis. Therefore, this PhD study aims to partially fill a knowledge gap in the study between Spanish and Chinese.