Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain

dc.contributor.authorGuerreiro, Camila
dc.contributor.authorLeal, Fátima
dc.contributor.authorPinho, Micaela
dc.date.accessioned2025-11-17T12:24:48Z
dc.date.available2025-11-17T12:24:48Z
dc.date.issued2025-11-14
dc.description.abstractThe growing demand for data-driven solutions in healthcare is often hindered by limited access to high-quality datasets due to privacy concerns, data imbalance, and regulatory constraints. Synthetic data generation has emerged as a promising strategy to address these challenges by creating artificial yet statistically valid datasets that preserve the underlying patterns of real data without compromising patient confidentiality. This study explores methodologies for generating synthetic data tailored to binary and multi-class classification problems within the health domain. We employ advanced techniques such as probabilistic modelling, generative adversarial networks, and data augmentation strategies to replicate realistic feature distributions and class relationships. A comprehensive evaluation is conducted using benchmark healthcare datasets, measuring fidelity, diversity, and utility of the synthetic data in downstream predictive modelling tasks. The original dataset consisted of 2125 imbalanced cases, both in the binary and multi-class classification scenarios. Experimental results demonstrate that models trained on synthetic datasets achieve performance levels comparable to those trained on real data, particularly in scenarios with severe class imbalance. The findings underscore the potential of synthetic data as a privacy-preserving enabler for robust machine learning applications in healthcare, facilitating innovation while adhering to strict data protection regulations.
dc.identifier.citationGuerreiro, C., Leal, F., & Pinho, M. (2025). Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain. Information, 16(11), 986, 1-21. https://doi.org/10.3390/info16110986. Repositório Institucional UPT. https://hdl.handle.net/11328/6773
dc.identifier.issn2078-2489
dc.identifier.urihttps://hdl.handle.net/11328/6773
dc.language.isoeng
dc.publisherMDPI - Multidisciplinary Digital Publishing Institute
dc.relation.hasversionhttps://doi.org/10.3390/info16110986
dc.rightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectsynthetic data
dc.subjectbinary
dc.subjectmulti-class
dc.subjectclassification
dc.subjecthealth
dc.subjectdata balancing
dc.subject.fosCiências Sociais - Economia e Gestão
dc.titleSynthetic Data Generation for Binary and Multi-Class Classification in the Health Domain
dc.typejournal article
dcterms.referenceshttps://www.mdpi.com/2078-2489/16/11/986
dspace.entity.typePublication
oaire.citation.endPage21
oaire.citation.issue11
oaire.citation.startPage1
oaire.citation.titleInformation
oaire.citation.volume16
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85
person.affiliation.nameREMIT – Research on Economics, Management and Information Technologies
person.affiliation.nameREMIT – Research on Economics, Management and Information Technologies
person.familyNameLeal
person.familyNamePinho
person.givenNameFátima
person.givenNameMicaela
person.identifier.ciencia-id2211-3EC7-B4B6
person.identifier.ciencia-idAF14-3E2F-3400
person.identifier.orcid0000-0003-4418-2590
person.identifier.orcid0000-0003-2021-9141
person.identifier.ridY-3460-2019
person.identifier.ridL-1789-2018
person.identifier.scopus-author-id57190765181
person.identifier.scopus-author-id23990998900
relation.isAuthorOfPublication8066078f-1e30-4b0a-aa84-3b6a2af4185c
relation.isAuthorOfPublicationb73425ae-9c53-43ec-9bef-8d0ebebecc6b
relation.isAuthorOfPublication.latestForDiscovery8066078f-1e30-4b0a-aa84-3b6a2af4185c

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Guerreiro C.; Leal F. Pinho M. (2025) - 14-11-2025.pdf
Size:
310.16 KB
Format:
Adobe Portable Document Format