Larramendiren Hiztegi Hirukoitzaren digitalizazioaKaraktereen ezagutze optikoa eta Wikitekara igotzea

  1. Mikel Alonso 1
  2. David Lindemann
  1. 1 Universidad del País Vasco/Euskal Herriko Unibertsitatea
    info

    Universidad del País Vasco/Euskal Herriko Unibertsitatea

    Lejona, España

    ROR https://ror.org/000xsnr85

Journal:
Uztaro: giza eta gizarte-zientzien aldizkaria

ISSN: 1130-5738

Year of publication: 2022

Issue: 120

Pages: 83-93

Type: Article

DOI: 10.26876/UZTARO.120.2022.5 DIALNET GOOGLE SCHOLAR

More publications in: Uztaro: giza eta gizarte-zientzien aldizkaria

Sustainable development goals

Abstract

In this article, we describe the OCR process using machine learning, a part of artificial intelligence, in the digitization of Larramendi’s Diccionario Trilingüe. For this purpose, we describe the pre-treatment of scanned images and the training of a tool named Kraken, in order to train a model from a hand-transcribed sample that will recognize the dictionary text. In addition to the highly accurate text, files containing italics and the position of the text have been created, enabling the representation of the structure of the dictionary. As the results are available on the Wikisource platform, the transcription can be corrected using crowdsourcing, so that we can carry out the information extraction process of the corrected transcription using machine learning to extract the lexicographic structure of the dictionary. With this information, the first proposal for a RDF model will be developed, in order to integrate the data in Wikidata.