Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues
| dc.contributor.author | Totlani, Ketan | |
| dc.contributor.author | Patil, Smital | |
| dc.contributor.author | Sasikumar, Abhijai | |
| dc.contributor.author | Moreira, Fernando | |
| dc.contributor.author | Mohanty, Sachi Nandan | |
| dc.date.accessioned | 2025-11-14T11:07:30Z | |
| dc.date.available | 2025-11-14T11:07:30Z | |
| dc.date.issued | 2025-11-11 | |
| dc.description.abstract | Contemporary Text-to-Speech (TTS) technologies have reached incredible levels of precision in producing proper speech. However, the output is often emotionless and robotic because synthesizing emotions remains a challenge. This problem is particularly important for applications like virtual assistants, healthcare aides, and immersive voice technologies, which are centered around a person and where emotionally intelligent dialogue improves the experience. The goal of the research presented here is to design a deep learning, multimodal, emotion-aware speech synthesis framework that would generate expressive speech. This study utilizes the RAVDESS emotional speech dataset and incorporates two advanced models: Tacotron 2, which is a sequence-to-sequence model for spectrogram generation, and a Prosody-Guided Conditional GAN (cGAN) which improves emotional prosody by refining pitch and energy. The experimental evaluation provided by Mean Opinion Score (MOS), Mel Cepstral Distortion (MCD), and F0 RMSE proved that the system is capable of generating speech that is both extremely natural (MOS: 4.32) and emotionally aligned (Emotion MOS: 4.15). The results support the hypothesis that combining prosodic conditioning and synthesis of spectrograms is effective and opens new possibilities for next generation AI communication systems by significantly enhancing the generation of emotion-laden speech. | |
| dc.identifier.citation | Totlani, K., Patil, S., Sasikumar, A., Moreira, F., & Mohanty, S. N. (2025). Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues. In 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 06-08 August 2025, (pp. 104-108). IEEE. https://doi.org/10.1109/MIPR67560.2025.00025. Repositório Institucional UPT. https://hdl.handle.net/11328/6769 | |
| dc.identifier.isbn | 979-8-3315-9465-7 | |
| dc.identifier.isbn | 979-8-3315-9466-4 | |
| dc.identifier.uri | https://hdl.handle.net/11328/6769 | |
| dc.language.iso | eng | |
| dc.publisher | IEEE | |
| dc.relation.hasversion | https://doi.org/10.1109/MIPR67560.2025.00025 | |
| dc.rights | restricted access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Emotion-Aware Speech Synthesis | |
| dc.subject | Tacotron 2 | |
| dc.subject | Prosody-Guided Conditional GAN | |
| dc.subject | Multimodal Deep Learning | |
| dc.subject | RAVDESS Dataset | |
| dc.subject | Emotional Prosody Modeling | |
| dc.subject | Text-to-Speech (TTS) | |
| dc.subject | Mel Spectrogram Generation | |
| dc.subject.fos | Ciências Naturais - Ciências da Computação e da Informação | |
| dc.title | Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues | |
| dc.type | conference paper | |
| dcterms.references | https://ieeexplore.ieee.org/document/11225978/authors#full-text-header | |
| dspace.entity.type | Publication | |
| oaire.citation.conferenceDate | 2025-08-06 | |
| oaire.citation.conferencePlace | San Jose, CA, USA | |
| oaire.citation.endPage | 108 | |
| oaire.citation.startPage | 104 | |
| oaire.citation.title | 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR) | |
| oaire.version | http://purl.org/coar/version/c_970fb48d4fbd8a85 | |
| person.affiliation.name | Universidade Portucalense | |
| person.familyName | Moreira | |
| person.givenName | Fernando | |
| person.identifier.ciencia-id | 7B1C-3A29-9861 | |
| person.identifier.orcid | 0000-0002-0816-1445 | |
| person.identifier.rid | P-9673-2016 | |
| person.identifier.scopus-author-id | 8649758400 | |
| relation.isAuthorOfPublication | bad3408c-ee33-431e-b9a6-cb778048975e | |
| relation.isAuthorOfPublication.latestForDiscovery | bad3408c-ee33-431e-b9a6-cb778048975e |
Files
Original bundle
1 - 1 of 1