Quechua synthesizers using deep learning techniques
Fecha
Autores
Título de la revista
ISSN de la revista
Título del volumen
Editor
Pontificia Universidad Católica del Perú
Acceso al texto completo solo para la Comunidad PUCP
Resumen
La presente investigación aborda la problemática de la pérdida del patrimonio
cultural y lingüístico del quechua, una lengua en peligro de extinción
según la UNESCO. Nos centramos en la variedad Chanca del quechua (Chanca
runasimi), hablada en los departamentos de Andahuaylas, Ayacucho y Huancavelica,
con aproximadamente un millón de hablantes.
El objetivo de este trabajo es la implementación de cuatro sintetizadores
de voz en idioma quechua: Tacotron 2 + WaveNet, FastPitch + WaveGlow, VITS
y XTTS v2; mediante pre-training y fine-tuning sobre un conjunto de 2,300 grabaciones
de alta calidad. La evaluación objetiva considera la calidad prosódica y
espectral mediante F0 RMSE y SECS, mientras que la evaluación subjetiva analiza
la naturalidad y la similitud mediante MOS-N y MOS-S, respectivamente. La
finalidad es proveer recursos abiertos que apoyen la revitalización lingüística y
reduzcan las barreras tecnológicas en comunidades quechuahablantes.
Los resultados muestran que el modelo end to end XTTS v2 ofrece mayor
naturalidad y similitud con la voz original. Se observan diferencias de frecuencia
fundamental de hasta 12 Hz en promedio y valores de SECS al rededor de 0.98,
lo que indica modelos más precisos y consistentes. Asimismo, las puntuaciones
MOS-N y MOS-S alcanzan 4.9 y 4.8, evidenciando una fuerte aceptación por
parte de los evaluadores quechuahablantes.
Finalmente, este enfoque busca contribuir a la preservación del quechua
mediante recursos tecnológicos que promuevan su uso y difusión en contextos
educativos, culturales y de accesibilidad. Los sintetizadores desarrollados facilitan
la creación de materiales pedagógicos y contenidos digitales en quechua,
fortaleciendo su presencia en la sociedad peruana y su integración en plataformas
tecnológicas contemporáneas.
This research addresses the loss of Quechua’s cultural and linguistic heritage, a language classified as endangered according to UNESCO reports. We focused on the Chanca variety of Quechua (Chanca runasimi), spoken in the departments of Andahuaylas, Ayacucho, and Huancavelica, which has approximately one million speakers. The objective of this paper is the implementation of four Quechua textto- speech synthesizers: Tacotron 2 + WaveNet, FastPitch + WaveGlow, VITS and XTTS v2; using complementary pre-training and fine-tuning, with a data set of 2,300 high-quality recordings. We performed an objective evaluation of the prosodic and spectral quality with F0 RMSE and SECS, and a subjective evaluation of the naturalness and similarity with MOS-N and MOS-S, respectively. The purpose is to provide open resources that support linguistic revitalization and reduce technological barriers in Quechua-speaking communities. The results indicate that the end-to-end XTTS v2 model offers greater naturalness and similarity to the original voice. Differences in fundamental frequency of up to 12 Hz on average, and SECS values of around 0.98, were also observed, indicating more accurate and consistent models. Furthermore, the MOS-N and MOS-S scores reached 4.9 and 4.8, respectively, demonstrating strong acceptance of the model among the evaluators. Finally, this approach seeks to contribute to the preservation of the Quechua language through technological resources that promote its use and dissemination in diverse contexts. These synthesizers will facilitate the creation of educational and cultural tools, increase accessibility to the language, and strengthen its relevance in Peru’s society.
This research addresses the loss of Quechua’s cultural and linguistic heritage, a language classified as endangered according to UNESCO reports. We focused on the Chanca variety of Quechua (Chanca runasimi), spoken in the departments of Andahuaylas, Ayacucho, and Huancavelica, which has approximately one million speakers. The objective of this paper is the implementation of four Quechua textto- speech synthesizers: Tacotron 2 + WaveNet, FastPitch + WaveGlow, VITS and XTTS v2; using complementary pre-training and fine-tuning, with a data set of 2,300 high-quality recordings. We performed an objective evaluation of the prosodic and spectral quality with F0 RMSE and SECS, and a subjective evaluation of the naturalness and similarity with MOS-N and MOS-S, respectively. The purpose is to provide open resources that support linguistic revitalization and reduce technological barriers in Quechua-speaking communities. The results indicate that the end-to-end XTTS v2 model offers greater naturalness and similarity to the original voice. Differences in fundamental frequency of up to 12 Hz on average, and SECS values of around 0.98, were also observed, indicating more accurate and consistent models. Furthermore, the MOS-N and MOS-S scores reached 4.9 and 4.8, respectively, demonstrating strong acceptance of the model among the evaluators. Finally, this approach seeks to contribute to the preservation of the Quechua language through technological resources that promote its use and dissemination in diverse contexts. These synthesizers will facilitate the creation of educational and cultural tools, increase accessibility to the language, and strengthen its relevance in Peru’s society.
Descripción
Palabras clave
Aprendizaje profundo (Aprendizaje automático), Sistemas de procesamiento de voz, Quechua--Dialectos--Perú, Patrimonio cultural--Conservación--Perú
Citación
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licencia Creative Commons
Excepto donde se indique lo contrario, la licencia de este ítem se describe como https://purl.org/coar/access_right/c_f1cf
