Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain

Data

2025-11-14

Embargo

Orientador

Coorientador

Título da revista

ISSN da revista

Título do volume

Editora

MDPI - Multidisciplinary Digital Publishing Institute
Idioma
Inglês

Projetos de investigação

Unidades organizacionais

Fascículo

Título Alternativo

Resumo

The growing demand for data-driven solutions in healthcare is often hindered by limited access to high-quality datasets due to privacy concerns, data imbalance, and regulatory constraints. Synthetic data generation has emerged as a promising strategy to address these challenges by creating artificial yet statistically valid datasets that preserve the underlying patterns of real data without compromising patient confidentiality. This study explores methodologies for generating synthetic data tailored to binary and multi-class classification problems within the health domain. We employ advanced techniques such as probabilistic modelling, generative adversarial networks, and data augmentation strategies to replicate realistic feature distributions and class relationships. A comprehensive evaluation is conducted using benchmark healthcare datasets, measuring fidelity, diversity, and utility of the synthetic data in downstream predictive modelling tasks. The original dataset consisted of 2125 imbalanced cases, both in the binary and multi-class classification scenarios. Experimental results demonstrate that models trained on synthetic datasets achieve performance levels comparable to those trained on real data, particularly in scenarios with severe class imbalance. The findings underscore the potential of synthetic data as a privacy-preserving enabler for robust machine learning applications in healthcare, facilitating innovation while adhering to strict data protection regulations.

Palavras-chave

synthetic data, binary, multi-class, classification, health, data balancing

Tipo de Documento

Artigo

Citação

Guerreiro, C., Leal, F., & Pinho, M. (2025). Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain. Information, 16(11), 986, 1-21. https://doi.org/10.3390/info16110986. Repositório Institucional UPT. https://hdl.handle.net/11328/6773

Identificadores

TID

Designação

Tipo de Acesso

Acesso Aberto

Apoio

Descrição