Desarrollo de un modelo Text-to-Speech para la lengua Awajún y su evaluación automática con ASR
Fecha
Autores
Título de la revista
ISSN de la revista
Título del volumen
Editor
Pontificia Universidad Católica del Perú
Acceso al texto completo solo para la Comunidad PUCP
Resumen
Este trabajo de investigación tiene como objetivo desarrollar un modelo Text-to-Speech
(TTS) para la lengua Awajún, una de las 48 lenguas originarias del Perú, con el propósito de
contribuir a su preservación mediante el uso de un modelo de síntesis de voz basado en
aprendizaje profundo. Para el desarrollo de este modelo se utilizaron las arquitecturas
Tacotron 2 y HiFi-GAN, ampliamente utilizadas en la generación de voz de calidad. El
proceso metodológico incluyó la recolección, limpieza y alineación de un conjunto de datos
compuesto por audios y textos en lengua Awajún, obtenidos de las plataformas Scripture
Earth y Ojo Público. Posteriormente, los datos fueron utilizados para entrenar a diferentes
modelos TTS, generando muestras de audio a partir de texto escrito.
La evaluación de los modelos TTS se realizó mediante la métrica CER (Character Error
Rate), utilizando un modelo de Automatic Speech Recognition (ASR). Los resultados
permitieron identificar el modelo con mejor desempeño que logró generar el habla en lengua
Awajún, demostrando el potencial de las redes neuronales para el procesamiento de lenguas
de bajos recursos. Finalmente, se utilizó la métrica MOS, en la que hablantes nativos
calificaron la naturalidad de los audios generados del mejor modelo identificado. Este trabajo
constituye un aporte significativo a la preservación de la lengua Awajún, y abre la posibilidad
de futuras investigaciones orientadas a la creación de herramientas tecnológicas para la
lengua Awajún y otras lenguas originarias del Perú.
This research aims to develop a Text-to-Speech (TTS) model for the Awajún language, one of the 48 indigenous languages of Peru, with the purpose of contributing to its preservation through a voice synthesis model based on deep learning. For the development of this model, the Tacotron 2 and HiFi-GAN architectures were used, both widely recognized for generating high-quality speech. The methodological process included the collection, cleaning, and alignment of a dataset composed of audio recordings and Awajún text, obtained from the Scripture Earth and Ojo Público platforms. Subsequently, the data were used to train different TTS models, generating audio samples from written text. The evaluation of the TTS models was carried out using the Character Error Rate (CER) metric, with the support of an Automatic Speech Recognition (ASR) model. The results made it possible to identify the best-performing model, which successfully generated speech in the Awajún language, demonstrating the potential of neural networks for processing low-resource languages. Finally, the Mean Opinion Score (MOS) metric was employed, in which native speakers rated the naturalness of the audios generated by the best-identified model. This work represents a significant contribution to the preservation of the Awajún language and opens the door to future research focused on developing technological tools for the Awajún language and other indigenous languages of Peru.
This research aims to develop a Text-to-Speech (TTS) model for the Awajún language, one of the 48 indigenous languages of Peru, with the purpose of contributing to its preservation through a voice synthesis model based on deep learning. For the development of this model, the Tacotron 2 and HiFi-GAN architectures were used, both widely recognized for generating high-quality speech. The methodological process included the collection, cleaning, and alignment of a dataset composed of audio recordings and Awajún text, obtained from the Scripture Earth and Ojo Público platforms. Subsequently, the data were used to train different TTS models, generating audio samples from written text. The evaluation of the TTS models was carried out using the Character Error Rate (CER) metric, with the support of an Automatic Speech Recognition (ASR) model. The results made it possible to identify the best-performing model, which successfully generated speech in the Awajún language, demonstrating the potential of neural networks for processing low-resource languages. Finally, the Mean Opinion Score (MOS) metric was employed, in which native speakers rated the naturalness of the audios generated by the best-identified model. This work represents a significant contribution to the preservation of the Awajún language and opens the door to future research focused on developing technological tools for the Awajún language and other indigenous languages of Peru.
Descripción
Palabras clave
Aprendizaje profundo (Aprendizaje automático), Aguaruna, Lenguas indígenas--Perú--Amazonía, Región, Procesamieto en lenguaje natural (Computación)
Citación
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licencia Creative Commons
Excepto donde se indique lo contrario, la licencia de este ítem se describe como info:eu-repo/semantics/openAccess
