Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako

ODRIOZOLA SUSTAETA, IGOR

Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako

ODRIOZOLA SUSTAETA, IGOR

Dirigida por:

Eva Navas Cordón Director/a
Inmaculada Hernáez Rioja Director/a

Universidad de defensa: Universidad del País Vasco - Euskal Herriko Unibertsitatea

Fecha de defensa: 03 de mayo de 2019

Tribunal:

Alfonso Ortega Giménez Presidente/a
Ibon Saratxaga Couceiro Secretario/a
Alberto Abad Gareta Vocal

Departamento:

Ingeniería de Comunicaciones

Tipo: Tesis

Teseo: 149790 DIALNET ADDI editor

Resumen

There is a growing interest in the use of speech technology in computer-assisted language learning (CALL) systems. This is mainly due to the fact that speech technologies have considerably improved during the last years, and nowadays more and more people make use of them, ever more naturally. Literature shows that two major points of interest are involved in the use of Automatic Speech Recognition (ASR) in CALL: Computer-assisted Pronunciation Training (CAPT) and Spoken Grammar Practise (SGP). In a CAPT typical application, the user is required to read and record a sentence, and send it to the learning application server. It returns a result, generally using a three-level colour system indicating which phone has been correctly pronounced and which has not. SGP applications are not very popular yet, but some examples are multiple-choice tests, where users orally choose an answer between several choices, or even the Intelligent Language Tutoring Systems, where learners respond to cues given by the computer and the system provides feedback. Such tools can be used to strengthen the students' autonomy on their learning process, giving the opportunity of using voice to improve their pronunciation or to do grammar exercises outside the classroom. In this work, two applications have been considered: on the one hand, a classical CAPT system, where the student records a predefined sentence and the system gives a feedback about the pronunciation; on the other hand, a novel Word-by-Word Sentence Verification (WWSV) system, where a sentence is verified sequentially, word by word, and as soon as a word is detected it is displayed to the user. The WWSV tool gives the option of creating a tool to solve grammar exercises orally (SGP). Both systems rely on utterance verification techniques, as the popular Goodness-of-pronunciation (GOP) score. The acoustic database chosen to train the systems is the Basque Speecon-like database, the only one publicly available for Basque, specifically designed for speech recognition and recorded through microphone. This database presents several drawbacks, such as the lack of a pronunciation lexicon and some annotation files. Furthermore, it contains much dialectal speech, mainly in the free spontaneous part. In ASR, phonetic variations can be modelled using a single acoustic model. However, CAPT systems require "clean" models to use as reference. Thus, some work had to be carried out on the annotation files of the database. A number of transcriptions have been changed and a new lexicon has been created considering different dialectal alternatives. It is noticeable that the new lexicon contains in average 4.12 different pronunciations per word. The speech recognition software used in this thesis is AhoSR. It has been created and developed by the Aholab research group, and it has been designed to cope with different recognitions tasks. In this thesis, utterance verification techniques have been implemented to be run together with the basic tasks. To do so, a parallel graph has been implemented to obtain GOP scores. For CAPT and WWSV tasks, specific search graphs have been added, in order to adapt to the needs of each of them. In addition, sockets have been implemented in the audio-input module of AhoSR. This allows real time performing when accessing the recogniser through the internet, and so it gives us the opportunity for AhoSR to be installed on a server, with universal access. Different ways to train Hidden Markov Models (HMM) have been analysed in depth. Initially, HMMs of better quality were expected by means of using the new dictionary with alternatives. However, results do not show this, probably because of the big amount of alternative pronunciations. The addition of some manually corrected data (15 % of the training set) allows obtaining similar results to those obtained using a single-entry dictionary. In order to take advantage of the manually corrected transcriptions, different ways of training HMMs have been analysed. Thus, we have found that slightly better HMMs are achieved using data with few transcription errors in the initial stages of the training and then using the whole database. To build the initial system, two GOP distributions were considered necessary to classify each phone: the distribution of the correctly pronounced phones and the distribution of the incorrectly pronounced ones. The GOPs of the incorrectly pronounced phones were obtained simulating errors and obtaining the GOP scores by forced alignment in the AhoSR decoder. Thus, the thresholds between correctly and incorrectly uttered phones were calculated as the Equal Error Rate (EER) point of both distributions. This approach was implemented in an initial prototype, and several laboratory experiments were performed which produced very good results. Then, the system was tested in more realistic environments: Basque language schools, among 20 students. The objective results along with the survey filled in by the 20 students who tested the system were really promising. The initial prototype was executed locally, and we felt the need of developing a more universal system in order to be accessed from any device and anywhere. Thus, we took advantage of the specifications of the recent HTML5 standard, which let the browser access the audio input, regardless of the platform, by means of the audio API. This has given us the opportunity to create a system accessible from any operative system. Moreover, for the WWSV-based SGP task, another API of HTML5 has been used (the web API ), which creates socket-like connections between the browser and the server, in order to send audio data on the fly. Several drawbacks have been managed for the on-line implementation of the system: for example, due to the different devices that users will use to pick up audio, some kind of parameter normalisation is needed. Furthermore, an on-line normalisation technique is necessary, since in WWSV continuous feedback must be provided before the whole signal has arrived to the recogniser. Different techniques have been tested to implement Cepstral Mean and Variance Normalisation (CMVN) and estimate the initial values of cepstral means and variances. The best results have been obtained by a hybrid approach proposed in this work, so that the initial means are estimated using the first N frames, and initial variances are obtained from the training datasets. In addition, a new CMVN technique has been devised in this thesis: the Multi-Normalisation Scoring (MNS) based CMVN. MNS consists in generating multiple observation likelihood scores by normalising the incoming Mel-Frequency Cepstral Coefficients (MFCC) using means and variances computed from different speech datasets recorded under different conditions. The MNS-based CMVN consists in computing the probabilities of a frame to belong to different training datasets; thus, these probabilities can be used as weights to calculate an estimation of the actual means and variances. The results obtained are remarkable, mainly for clean signals. The greatest advantage of using MNS is that the CMVN can perform on-line, frame by frame, with no need to analyse the neighbouring frames or the frames of a segment to which it belongs. Using the same MSN method, a novel and effective on-line Voice Activity Detector (VAD) has been devised as well. In a validation experiment, comparing the results of our MNS-based VAD with the results of two ITU-T VAD algorithms (G.720.1 and G.729b), we have obtained better overall results, since the classification errors are considerably lower for non-speech frames, and are comparable for speech frames. This makes our system useful for systems that require low speech error rates and also for low non-speech error rates. Finally, Neural Networks have been used in an attempt to see the impact of different parameters at the time of training a classifier. As a consequence, we have seen that GOP scores are the most efficient parameters among durations and log-likelihoods of the previous, current and posterior phones. The results of the experiments are coherent with those obtained in the initial system.