Alex X. Wang and Binh P. Nguyen(https://people.wgtn.ac.nz/b.nguyen) ∗
Tabular data generation is a complex task due to its distinctive characteristics and inherent complexities. While Variational Autoencoders have been adapted from the computer vision domain for tabular data synthesis, their reliance on non-deterministic latent space regularization introduces limitations. The stochastic nature of Variational Autoencoders can contribute to collapsed posteriors, yielding suboptimal outcomes and limiting control over the latent space. This characteristic also constrains the exploration of latent space interpolation. To address these challenges, we present the Tabular Wasserstein Autoencoder (TWAE), leveraging the deterministic encoding mechanism of Wasserstein Autoencoders. This characteristic facilitates a deterministic mapping of inputs to latent codes, enhancing the stability and expressiveness of our model's latent space. This, in turn, enables seamless integration with shallow interpolation mechanisms like the synthetic minority over-sampling technique (SMOTE) within the data generation process via deep learning. Specifically, TWAE is trained once to establish a low-dimensional representation of real data, and various latent interpolation methods efficiently generate synthetic latent points, achieving a balance between accuracy and efficiency. Extensive experiments consistently demonstrate TWAE's superiority, showcasing its versatility across diverse feature types and dataset sizes. This innovative approach, combining WAE principles with shallow interpolation, effectively leverages SMOTE's advantages, establishing TWAE as a robust solution for complex tabular data synthesis.
Python version: 3.10
pip install -r requirements.txt
\data_profile{dataname.json}
python s1_download_dataset.py
python s2_process_dataset.py
For non-deep models, including SMOTE and its variants, Synthpop, Copula and traditional deep generative models, including CTGAN, TVAE and CopulaGan, use the following code
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_METHODS] --mode train
Options of [NAME_OF_DATASET] in the sample code: adult, abalone Options of [NAME_OF_METHODS]: smote, synthpop, copula, ctgan, tvae and copulagan
For other more advanced deep generative models, we break down the process into model training and data sampling:
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_METHODS] --mode train
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_METHODS] --mode sample
Options of [NAME_OF_DATASET] in the sample code: adult, abalone Options of [NAME_OF_METHODS]: ctabgan,tabddpm and twae
This project is licensed under the Apache-2.0 License.
We appreciate your citations if you find this repository useful to your research!
@article{wang2025twae,
title={{Deterministic Autoencoder using Wasserstein Loss for Tabular Data Generation}},
author={Alex X. Wang and Binh P. Nguyen},
journal={Neural Networks},
year={2025},
publisher={Elsevier}
}