When we think of generative AI, images, text, and audio come to mind. But what if neural networks could imagine structured data tables, the kind analysts and data scientists use every day?
While generating cat images or realistic human faces is a solved problem, tabular data presents a unique challenge. Each column has its own meaning, scale, and statistical distribution. Relationships can be linear or nonlinear, categorical or numeric, and they often carry business logic (income should not be negative).
This article explores how a Variational Autoencoder (VAE) learns to “dream” in structured data: capturing its patterns, correlations, and variability, and generating new, realistic samples that feel authentic but reveal no real individuals.
Unlike images (which are spatially coherent) or text (which is sequential), tables are heterogeneous and discontinuous.
| Challenge | Description |
|---|---|
| Feature independence | Each column represents a different statistical distribution. |
| Mixed data types | Categorical + numeric + ordinal variables require special handling. |
| Non-spatial structure | There’s no concept of “neighboring pixels” or sequential context. |
| Correlations | Real-world dependencies between columns (higher income → higher spending) must be preserved. |
The lack of natural geometry makes tabular synthesis one of the most complex tasks in generative modeling. That’s where VAEs offer a compelling solution.
A Variational Autoencoder (VAE) is a generative neural network architecture that learns to compress data into a latent representation and reconstruct it, with a twist.
Instead of learning deterministic compression, it learns a probability distribution over latent features.
X → [Encoder] → μ, σ → z ~ N(μ, σ) → [Decoder] → X̂
- The encoder takes input data
Xand outputs two vectors:μ(mean)σ(variance)
describing the distribution of the latent variablez.
- The latent space samples
zusing the reparameterization trick:
- The decoder reconstructs the input from
z.
- The first term ensures the output looks like the input.
- The second term ensures the latent space follows a smooth Gaussian prior, so we can sample from it later.
This balance allows VAEs to generate new samples that look statistically consistent but not identical to real ones.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
df = pd.read_csv("data/real_data.csv")
num_cols = ["age", "income", "score"]
cat_cols = ["segment", "region"]
scaler = StandardScaler()
encoder = OneHotEncoder(sparse_output=False)
X_num = scaler.fit_transform(df[num_cols])
X_cat = encoder.fit_transform(df[cat_cols])
X = np.concatenate([X_num, X_cat], axis=1)import torch
from torch import nn
class VAE(nn.Module):
def __init__(self, input_dim, latent_dim=8, hidden_dim=64):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU()
)
self.mu = nn.Linear(hidden_dim, latent_dim)
self.logvar = nn.Linear(hidden_dim, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, input_dim)
)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
h = self.encoder(x)
mu, logvar = self.mu(h), self.logvar(h)
z = self.reparameterize(mu, logvar)
return self.decoder(z), mu, logvardef vae_loss(x, x_hat, mu, logvar):
recon_loss = nn.MSELoss()(x_hat, x)
kl_loss = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + 0.001 * kl_lossTraining stabilizes around 30 epochs for small datasets.
Once trained, we can sample new latent vectors and decode them:
vae.eval()
with torch.no_grad():
z = torch.randn(500, latent_dim)
synthetic = vae.decoder(z).numpy()The result: a completely new synthetic dataset with similar statistical properties but zero overlap with the original rows.
A neural network can dream, but can it dream realistically?
| Feature | Real vs Synthetic (JSD ↓) |
|---|---|
| Age | 0.11 |
| Income | 0.18 |
| Score | 0.09 |
| Visits | 0.22 |
→ The model captures most continuous distributions, though skewed variables can cause drift.
Preserves feature dependencies but introduces some nonlinear noise.
In latent 2D space, real and synthetic points overlap significantly, showing strong global similarity.
Synthetic data follows the same trendlines, though boundaries are softer, representing VAE’s stochastic nature.
| Aspect | Gaussian Copula | Variational Autoencoder |
|---|---|---|
| Type | Statistical model | Deep generative model |
| Dependency Modeling | Linear correlations | Nonlinear feature relationships |
| Scalability | High for small data | Scales with GPU & data size |
| Output Variability | Deterministic | Stochastic |
| Use Case | Privacy-preserving analytics | Simulation and creative synthesis |
→ The Copula wins in fidelity; the VAE wins in creativity.
In essence:
- The encoder compresses complex columns into a latent vector space.
- The latent space acts as a mental canvas.
- The decoder reconstructs samples, adding controlled noise.
Each point in this space represents a possible world consistent with your data.
- Data Privacy, Share realistic but anonymized datasets.
- Data Augmentation, Enrich rare classes in imbalanced datasets.
- Simulation, Model potential outcomes or synthetic populations.
- Synthetic Training Data, Train AI without risking exposure.
- Categorical explosion due to one-hot encoding.
- Poor performance with high-cardinality categorical variables.
- Sensitive to normalization and scaling.
- Hard to enforce semantic constraints (age > 0).
- Integrating CTGAN or TabDDPM for mixed-type handling.
- Adding differential privacy to control information leakage.
- Exploring latent disentanglement for interpretability.
- Hybrid approaches (Copula-VAE) for best of both worlds.
“A Variational Autoencoder doesn’t memorize, it generalizes.”
In teaching neural networks to imagine tables, we’re giving them a form of statistical creativity.
The VAE learns what makes a dataset plausible, not exact, bridging the gap between data privacy and AI imagination.
The next frontier: combining deep learning’s creativity with statistical precision to build the future of ethical, generative data science.