Development of a Machine Translation system for promoting the use of a low resource language in the clinical domain: the case of Basque

  1. Soto, X. 1
  2. Perez-de-Viñaspre, O. 1
  3. Oronoz, M. 1
  4. Labaka, G. 1
  1. 1 HiTZ Basque Center for Language Technologies-Ixa, University of the Basque Country UPV/EHU, Donostia, Spain
Book:
Natural Language Processing In Healthcare: A Special Focus on Low Resource Languages

Publisher: CRC Press, Taylor & Francis Group

ISBN: 9780367685393 9780367685409

Year of publication: 2022

Pages: 139-158

Type: Book chapter

Abstract

In multilingual environments wherein there is a strong language spoken by a majority and a low resource language spoken by a minority, Machine Translation (MT) can be useful for allowing clinical practitioners to write their reports in the minority language, which then can be automatically translated into the majority language. Current state-of-the-art approaches for MT require large quantities of parallel sentences of the desired languages and the specific domain, so MT systems developed for translating clinical text from/into a low resource language will usually need to go through a domain adaptation process. When there are enough in-domain resources for the target language, back-translation is commonly used for domain adaptation. In our case, since we have access to many Electronic Health Records (EHR) in Spanish, we make use of this and similar techniques that leverage monolingual data for translating clinical text from Basque into Spanish. Moreover, one of the main characteristics of clinical domain text is its rich terminology, which is not always available for any given language, so before developing an MT system for the clinical domain, it is beneficial to make a special effort to translate the clinical terminology into the low resource language. If the final objective is to implement a system that can be useful for clinical practitioners, it is important to work with them for defining the terminology and any other aspect that can affect the final performance of the systems. Needless to say, given the special relevance of the content to be translated, users of the MT systems should be aware of possible errors made; and whenever it is possible, a human translator should review the generated translations to guarantee their accuracy. In this chapter, we describe the approach we have followed to develop an MT system for translating clinical text from Basque into Spanish. In the first section, we introduce the Basque language and give some details about the sociolinguistic situation in the Basque Country. In the second section we present Itzulbide, the project carried out together with the Basque public health service for compiling clinical domain corpora to be used for developing an MT tool for the healthcare domain. In the third section, we overview the diverse corpora we have used in our systems, and specify how the training and evaluation are performed. The fourth section discusses the results obtained in the defined settings. Finally, the fifth section presents some conclusions and points to possible future directions.