Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues

Totlani, Ketan; Patil, Smital; Sasikumar, Abhijai; Moreira, Fernando; Mohanty, Sachi Nandan

Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues

dc.contributor.author	Totlani, Ketan
dc.contributor.author	Patil, Smital
dc.contributor.author	Sasikumar, Abhijai
dc.contributor.author	Moreira, Fernando
dc.contributor.author	Mohanty, Sachi Nandan
dc.date.accessioned	2025-11-14T11:07:30Z
dc.date.available	2025-11-14T11:07:30Z
dc.date.issued	2025-11-11
dc.description.abstract	Contemporary Text-to-Speech (TTS) technologies have reached incredible levels of precision in producing proper speech. However, the output is often emotionless and robotic because synthesizing emotions remains a challenge. This problem is particularly important for applications like virtual assistants, healthcare aides, and immersive voice technologies, which are centered around a person and where emotionally intelligent dialogue improves the experience. The goal of the research presented here is to design a deep learning, multimodal, emotion-aware speech synthesis framework that would generate expressive speech. This study utilizes the RAVDESS emotional speech dataset and incorporates two advanced models: Tacotron 2, which is a sequence-to-sequence model for spectrogram generation, and a Prosody-Guided Conditional GAN (cGAN) which improves emotional prosody by refining pitch and energy. The experimental evaluation provided by Mean Opinion Score (MOS), Mel Cepstral Distortion (MCD), and F0 RMSE proved that the system is capable of generating speech that is both extremely natural (MOS: 4.32) and emotionally aligned (Emotion MOS: 4.15). The results support the hypothesis that combining prosodic conditioning and synthesis of spectrograms is effective and opens new possibilities for next generation AI communication systems by significantly enhancing the generation of emotion-laden speech.
dc.identifier.citation	Totlani, K., Patil, S., Sasikumar, A., Moreira, F., & Mohanty, S. N. (2025). Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues. In 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 06-08 August 2025, (pp. 104-108). IEEE. https://doi.org/10.1109/MIPR67560.2025.00025. Repositório Institucional UPT. https://hdl.handle.net/11328/6769
dc.identifier.isbn	979-8-3315-9465-7
dc.identifier.isbn	979-8-3315-9466-4
dc.identifier.uri	https://hdl.handle.net/11328/6769
dc.language.iso	eng
dc.publisher	IEEE
dc.relation.hasversion	https://doi.org/10.1109/MIPR67560.2025.00025
dc.rights	restricted access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Emotion-Aware Speech Synthesis
dc.subject	Tacotron 2
dc.subject	Prosody-Guided Conditional GAN
dc.subject	Multimodal Deep Learning
dc.subject	RAVDESS Dataset
dc.subject	Emotional Prosody Modeling
dc.subject	Text-to-Speech (TTS)
dc.subject	Mel Spectrogram Generation
dc.subject.fos	Ciências Naturais - Ciências da Computação e da Informação
dc.title	Emotion-Aware Speech Synthesis using Multimodal Deep Learning with Visual and Textual Cues
dc.type	conference paper
dcterms.references	https://ieeexplore.ieee.org/document/11225978/authors#full-text-header
dspace.entity.type	Publication
oaire.citation.conferenceDate	2025-08-06
oaire.citation.conferencePlace	San Jose, CA, USA
oaire.citation.endPage	108
oaire.citation.startPage	104
oaire.citation.title	2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)
oaire.version	http://purl.org/coar/version/c_970fb48d4fbd8a85
person.affiliation.name	Universidade Portucalense
person.familyName	Moreira
person.givenName	Fernando
person.identifier.ciencia-id	7B1C-3A29-9861
person.identifier.orcid	0000-0002-0816-1445
person.identifier.rid	P-9673-2016
person.identifier.scopus-author-id	8649758400
relation.isAuthorOfPublication	bad3408c-ee33-431e-b9a6-cb778048975e
relation.isAuthorOfPublication.latestForDiscovery	bad3408c-ee33-431e-b9a6-cb778048975e

Files

Original bundle

Now showing 1 - 1 of 1

Name:: P115.pdf
Size:: 986.35 KB
Format:: Adobe Portable Document Format

Download

Collections

REMIT - Publicações em Livros de Atas Internacionais / Papers in International Proceedings