Voice personalization and speaker de-identification in speech processing systems
- Magariños Iglesias, María del Carmen
- Daniel Erro Eslava Director
- Eduardo Rodríguez Banga Director
Defence university: Universidade de Vigo
Fecha de defensa: 23 May 2019
- Juan Manuel Montero Martínez Chair
- Eva Navas Cordón Secretary
- Yannis Stylianou Committee member
Type: Thesis
Abstract
During the last years, we have witnessed a growth of speech processing systems based on statistical techniques, which combine acoustic modelling and vocoding. The main advantage of these systems is the flexibility they provide in modifying speech and model characteristics by applying transformation or adaptation techniques. In the specific case of speech synthesis, a broad range of transformation techniques have been explored and applied to both text-to-speech (TTS) and voice conversion (VC) systems. However, despite the great strides that have been made, we have come to the conclusion that transformation possibilities have not been fully exploited. In particular, we have found a couple of applications that have not been sufficiently developed. On the one hand, even though TTS systems have reached a high level of maturity and significant advances have been made in terms of speaker adaptation, adaptation across languages is still a challenge. On the other hand, although VC techniques have been widely studied and used in a variety of applications, speaker de-identification is emerging as a new research line where VC applies directly. These two problems, cross-lingual speaker adaptation and speaker de-identification, are addressed in this thesis. The cross-lingual challenge is faced in this thesis in the statistical parametric speech synthesis (SPSS) framework. More specifically, we propose a novel technique for cross-lingual speaker adaptation in hidden Markov model (HMM)-based speech synthesis. The proposed method, which operates at segmental level, allows rapidly cloning a given speaker in a different language without requiring any phonetic or linguistic information. Subjective and objective experiments, using different languages and a substantial number of speakers, prove the validity of our proposal, which we refer to as “language-independent acoustic cloning”. Experimental results show that our method is able to capture target speakers’ identity nearly as well as standard intra-lingual adaptation methods, while revealing a clear improvement compared to a state-of-the-art cross-lingual adaptation method used as baseline. As for the speaker de-identification problem, in this thesis it is approached as a special application of VC techniques, where some specific requirements must be considered. Two speaker de-identification strategies are presented, both relying on frequency warping (FW) functions combined with spectral amplitude scaling (AS), which tackle some shortcomings of the speaker de-identification methods found in the literature. One of the proposed approaches employs a pool of pre-trained transformation functions and similarity/dissimilarity criteria, in order to select the most appropriate transform given an input speaker for de-identification. The other strategy, on the contrary, is based on an adjustable parametric definition of the transformation functions and does not require any training. Perceptual listening tests and objective experiments confirm the success of the proposed strategies in hiding the identity of the given speakers, while preserving most of the naturalness of the original speech for the vast majority of the applied transformations. To conclude, we perform a study on the influence of speaker de-identification in depression detection using automatic tools. This study aims at analysing the impact of speaker de-identification procedures on speech features beyond speaker identity, and it also provides a more realistic scenario for assessing the performance of the proposed de-identification strategies. Experimental results show that the two strategies achieve satisfactory de-identification accuracies even when applied under uncontrolled conditions, whereas only a slight degradation is perceived in depression detection when using de-identified speech.