Skip to content

AmirhosseinHonardoust/Teaching-Neural-Networks-to-Imagine-Tables

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Teaching Neural Networks to Imagine Tables: Inside VAE for Tabular Data

Introduction, Can a Neural Network Dream in Tables?

When we think of generative AI, images, text, and audio come to mind. But what if neural networks could imagine structured data tables, the kind analysts and data scientists use every day?

While generating cat images or realistic human faces is a solved problem, tabular data presents a unique challenge. Each column has its own meaning, scale, and statistical distribution. Relationships can be linear or nonlinear, categorical or numeric, and they often carry business logic (income should not be negative).

This article explores how a Variational Autoencoder (VAE) learns to “dream” in structured data: capturing its patterns, correlations, and variability, and generating new, realistic samples that feel authentic but reveal no real individuals.


Why Tabular Data is Hard for Generative Models

Unlike images (which are spatially coherent) or text (which is sequential), tables are heterogeneous and discontinuous.

Challenge Description
Feature independence Each column represents a different statistical distribution.
Mixed data types Categorical + numeric + ordinal variables require special handling.
Non-spatial structure There’s no concept of “neighboring pixels” or sequential context.
Correlations Real-world dependencies between columns (higher income → higher spending) must be preserved.

The lack of natural geometry makes tabular synthesis one of the most complex tasks in generative modeling. That’s where VAEs offer a compelling solution.


Variational Autoencoders, The Creative Mathematician

A Variational Autoencoder (VAE) is a generative neural network architecture that learns to compress data into a latent representation and reconstruct it, with a twist.
Instead of learning deterministic compression, it learns a probability distribution over latent features.

Encoder → Latent Space → Decoder

X → [Encoder] → μ, σ → z ~ N(μ, σ) → [Decoder] → X̂
  1. The encoder takes input data X and outputs two vectors:
    • μ (mean)
    • σ (variance)
      describing the distribution of the latent variable z.
  2. The latent space samples z using the reparameterization trick:
Screenshot 2025-11-10 at 19-18-39 Repo style analysis
  1. The decoder reconstructs the input from z.

Loss Function: Balancing Accuracy and Creativity

Screenshot 2025-11-10 at 19-18-12 Repo style analysis
  • The first term ensures the output looks like the input.
  • The second term ensures the latent space follows a smooth Gaussian prior, so we can sample from it later.

This balance allows VAEs to generate new samples that look statistically consistent but not identical to real ones.


Implementing a VAE for Tabular Data

Step 1: Prepare the Data

from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

df = pd.read_csv("data/real_data.csv")
num_cols = ["age", "income", "score"]
cat_cols = ["segment", "region"]

scaler = StandardScaler()
encoder = OneHotEncoder(sparse_output=False)

X_num = scaler.fit_transform(df[num_cols])
X_cat = encoder.fit_transform(df[cat_cols])
X = np.concatenate([X_num, X_cat], axis=1)

Step 2: Define the VAE Model

import torch
from torch import nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim=8, hidden_dim=64):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU()
        )
        self.mu = nn.Linear(hidden_dim, latent_dim)
        self.logvar = nn.Linear(hidden_dim, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        h = self.encoder(x)
        mu, logvar = self.mu(h), self.logvar(h)
        z = self.reparameterize(mu, logvar)
        return self.decoder(z), mu, logvar

Step 3: Train the Model

def vae_loss(x, x_hat, mu, logvar):
    recon_loss = nn.MSELoss()(x_hat, x)
    kl_loss = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
    return recon_loss + 0.001 * kl_loss

Training stabilizes around 30 epochs for small datasets.


Sampling Synthetic Data

Once trained, we can sample new latent vectors and decode them:

vae.eval()
with torch.no_grad():
    z = torch.randn(500, latent_dim)
    synthetic = vae.decoder(z).numpy()

The result: a completely new synthetic dataset with similar statistical properties but zero overlap with the original rows.


Evaluating the Imagination

A neural network can dream, but can it dream realistically?

Distribution Overlap

Feature Real vs Synthetic (JSD ↓)
Age 0.11
Income 0.18
Score 0.09
Visits 0.22

→ The model captures most continuous distributions, though skewed variables can cause drift.

Correlation Heatmap

Preserves feature dependencies but introduces some nonlinear noise.

PCA Projection

In latent 2D space, real and synthetic points overlap significantly, showing strong global similarity.

Pairplot Comparison

Synthetic data follows the same trendlines, though boundaries are softer, representing VAE’s stochastic nature.


Comparing VAE to Copula

Aspect Gaussian Copula Variational Autoencoder
Type Statistical model Deep generative model
Dependency Modeling Linear correlations Nonlinear feature relationships
Scalability High for small data Scales with GPU & data size
Output Variability Deterministic Stochastic
Use Case Privacy-preserving analytics Simulation and creative synthesis

→ The Copula wins in fidelity; the VAE wins in creativity.


Visual Intuition, The Imagination of a Neural Network

In essence:

  • The encoder compresses complex columns into a latent vector space.
  • The latent space acts as a mental canvas.
  • The decoder reconstructs samples, adding controlled noise.

Each point in this space represents a possible world consistent with your data.


Applications

  1. Data Privacy, Share realistic but anonymized datasets.
  2. Data Augmentation, Enrich rare classes in imbalanced datasets.
  3. Simulation, Model potential outcomes or synthetic populations.
  4. Synthetic Training Data, Train AI without risking exposure.

Limitations

  • Categorical explosion due to one-hot encoding.
  • Poor performance with high-cardinality categorical variables.
  • Sensitive to normalization and scaling.
  • Hard to enforce semantic constraints (age > 0).

Future Directions

  • Integrating CTGAN or TabDDPM for mixed-type handling.
  • Adding differential privacy to control information leakage.
  • Exploring latent disentanglement for interpretability.
  • Hybrid approaches (Copula-VAE) for best of both worlds.

Conclusion

“A Variational Autoencoder doesn’t memorize, it generalizes.”

In teaching neural networks to imagine tables, we’re giving them a form of statistical creativity.
The VAE learns what makes a dataset plausible, not exact, bridging the gap between data privacy and AI imagination.

The next frontier: combining deep learning’s creativity with statistical precision to build the future of ethical, generative data science.

About

A comprehensive deep dive into how Variational Autoencoders (VAEs) learn to generate realistic synthetic tabular data. This project explores latent space learning, probabilistic modeling, and neural creativity, combining data privacy, interpretability, and generative AI techniques in a structured format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors