Synthetic Data Generation for AI Training: Complete Python Implementation Guide 2026
Data scarcity is killing AI projects. Organizations spend $50,000+ to collect edge cases manually—then wait 6-12 months for training to begin. Meanwhile, 60% of enterprises are already using synthetic data to cut costs by up to 99% while accelerating model deployment by 10x. linvelo
This isn't theoretical. ING Belgium generated 10,000 synthetic payment records in 2 minutes using SDV, achieving 100x test coverage in one-tenth the time. Deloitte synthesized entire databases with 8M+ records while maintaining GDPR compliance. By 2026, synthetic data is projected to comprise 60% of all AI training data—a structural shift reshaping how enterprises build AI systems. datacebo
This guide gives you the complete framework: production-ready Python implementations, privacy-preserving techniques with differential privacy, evaluation metrics that separate real quality from hype, and a decision framework to choose the right tools for your specific requirements.
We'll cover what actually matters for enterprise-grade systems: not just code, but compliance, privacy-utility trade-offs, and real-world deployment patterns that top-performing teams use.
The Problem: Why Synthetic Data Matters Now
Traditional data collection is broken for AI development. Organizations face three interconnected constraints:
Data Scarcity. Real datasets lack the edge cases and stress scenarios needed to build robust models. A fraud detection system trained on 500 legitimate transactions will fail catastrophically on the 501st unusual pattern. Healthcare teams can't wait years to accumulate rare disease cases. Financial institutions need to test payment systems under conditions that haven't occurred yet.
Privacy Exposure. Training AI on sensitive data—customer records, patient histories, financial transactions—creates regulatory liability. GDPR fines reach €20M+ for breaches. HIPAA violations average $11M per incident, up 53% since 2020. Pseudonymization isn't enough; even "anonymized" datasets risk re-identification when cross-referenced with external sources. scholarship.law.ufl
Cost Overruns. Manual data labeling costs $50,000 per edge case. Storage and compliance infrastructure multiply the pain. Traditional approaches bottle-neck at the human annotation stage: you can't label data faster than humans can review it.
Synthetic data solves all three simultaneously. By generating artificial data that preserves statistical patterns without containing real individuals, organizations can:
- Scale training data without waiting for real-world events to occur (60% cost reduction vs. collection) keymakr
- Compress timelines from 12 months to weeks (ING's 10K payments in 2 minutes) datacebo
- Reduce privacy risk from GDPR/HIPAA exposure while maintaining model utility (97.8% accuracy with differential privacy) mostly
- Stress-test systems on synthetic failure scenarios before production deployment invisibletech
By 2026, around 60% of all AI training data is projected to be artificially generated, according to enterprise forecasts backed by Microsoft, Google, and OpenAI. linvelo
Understanding the Technology: Methods & Architecture
Synthetic data generation combines three core techniques: statistical modeling, generative machine learning, and privacy-enhancing technologies.
Generation Methods
Generative Adversarial Networks (GANs) pit two neural networks against each other. A "generator" creates synthetic records while a "discriminator" tries to distinguish fake from real. The competition forces the generator to produce increasingly realistic data. GANs excel at capturing complex relationships but are famously difficult to train—mode collapse (getting stuck on limited patterns) and vanishing gradients slow convergence. datasciencecampus.ons.gov
Variational Autoencoders (VAEs) compress real data into a latent space, then decode random samples to generate new records. TVAEs (tabular VAEs) outperform CTGANs on KL Divergence in some benchmarks, achieving better distribution matching. The trade-off: VAEs are more stable than GANs but sometimes less flexible. thesai
Statistical Methods use copulas, Bayesian networks, or marginal distributions to capture relationships without deep learning. Gaussian Copulas are fast and interpretable; Bayesian Networks preserve conditional dependencies. The limitation: they struggle with high-dimensional complexity and sequential data. arxiv
Large Language Models (LLMs) can generate text and structured data with fine-tuning. Using DP-SGD (differentially private stochastic gradient descent) during fine-tuning ensures privacy guarantees—Google's recent work shows this produces synthetic text with mathematical privacy proofs. research
Key Architecture Decision: Single vs. Hybrid Approaches
Research from Stanford and MIT shows the optimal approach isn't "all synthetic." Instead:
- Start with high-quality human data ("gold set") to define what "good" looks like invisibletech
- Generate synthetic data for edge cases the original dataset barely covers
- Mix in a controlled ratio (typically 20-30% synthetic, 70-80% real) invisibletech
- Validate on production workflows, not abstract benchmarks
This hybrid approach outperforms both pure real data (limited edge cases) and pure synthetic (potential distribution shift).
Python Implementation: From Data to Production
Here's the complete workflow for generating, evaluating, and deploying synthetic data. We'll implement three methods: simple Faker-based generation, GAN-based synthesis using SDV, and differentially private generation using MOSTLY.AI.
Method 1: Faker for Simple Synthetic Data
Best for: Quick test data, non-sensitive attributes, rapid prototyping.
from faker import Faker
import pandas as pd
import random
fake = Faker()
# Generate synthetic customer records
def generate_fake_customers(num_records=1000):
customers = []
for _ in range(num_records):
customers.append({
'customer_id': fake.uuid4(),
'name': fake.name(),
'email': fake.email(),
'signup_date': fake.date_between(start_date='-2y'),
'country': fake.country(),
'account_value': round(random.uniform(100, 50000), 2),
'churn_risk': random.choice([0, 1]) # Binary classification
})
return pd.DataFrame(customers)
# Generate and save
synthetic_customers = generate_fake_customers(1000)
synthetic_customers.to_csv('synthetic_customers.csv', index=False)
print(f"Generated {len(synthetic_customers)} synthetic customer records")
Performance: Generates 3,000 records in <2 seconds. Suitable for dev/test environments but lacks statistical correlation to real data—use only for non-ML applications.
Method 2: Synthetic Data Vault (SDV) with CTGAN
Best for: Production-grade tabular data, mixed data types, published benchmarks.
SDV hit 10M downloads in 2025, making it the most accessible platform for researchers and enterprises. CTGAN (Conditional GAN) handles both continuous and categorical features simultaneously. datacebo
from sdv.single_table import GaussianCopulaModel, CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
import pandas as pd
# Load your real data
real_data = pd.read_csv('customer_transactions.csv')
# Define metadata (SDV needs to understand your data types)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
# Option A: Fast statistical method (Gaussian Copula)
# Good for: Baseline quality, speed
synth_gaussian = GaussianCopulaModel(metadata)
synth_gaussian.fit(real_data)
synthetic_data_copula = synth_gaussian.sample(num_rows=5000)
# Option B: Advanced deep learning (CTGAN)
# Good for: Maximum fidelity, complex relationships
synth_ctgan = CTGANSynthesizer(metadata, epochs=300)
synth_ctgan.fit(real_data)
synthetic_data_ctgan = synth_ctgan.sample(num_rows=5000)
# Evaluate quality
from sdv.evaluation.single_table import evaluate_quality
report = evaluate_quality(real_data, synthetic_data_ctgan, metadata)
print(f"Overall Quality Score: {report.get_score():.2%}")
print(f"Column Shapes: {report.get_details('Column Shapes'):.2%}")
print(f"Column Pair Trends: {report.get_details('Column Pair Trends'):.2%}")
Benchmarks (ING Belgium case study): datacebo
- Training time: <5 minutes on consumer hardware (no GPU needed)
- Sampling 10K records: 2 minutes
- Coverage improvement: 100x with 1/10th the manual effort
- Quality: 89-91% similarity on average datasets pypi
Accuracy Trade-offs: SDV's accuracy varies by model type. Single-table CTGAN achieves 52.7% accuracy on bivariate relationships, while enterprise solutions like MOSTLY.AI reach 97.8%. For research, SDV's 52.7% is acceptable; for production risk systems (fraud, credit decisions), the accuracy gap matters. mostly
Method 3: Differentially Private Synthesis with Gradient Clipping
Best for: Regulated industries (healthcare, finance), GDPR/HIPAA compliance, provable privacy.
Differential privacy adds mathematical noise during training so that no single individual's data significantly influences the model. This prevents re-identification attacks.
# Using MOSTLY.AI SDK (open source as of 2025)
from mostlyai import MostlyAI
import pandas as pd
# Load sensitive data (e.g., patient records)
sensitive_data = pd.read_csv('patient_records.csv')
# Initialize with differential privacy enabled
mostly_ai = MostlyAI(
epsilon=1.0, # Privacy budget (lower = more privacy, less utility)
max_rows=5000 # Max synthetic records to generate
)
# Train with DP guarantees
model = mostly_ai.train(
data=sensitive_data,
differential_privacy=True,
privacy_budget_tracking=True # Monitor epsilon throughout training
)
# Generate synthetic data with formal privacy guarantees
synthetic_data = model.sample(num_rows=5000)
# Check privacy metrics
privacy_report = model.evaluate_privacy(synthetic_data)
print(f"Epsilon (privacy budget): {privacy_report['epsilon']:.2f}")
print(f"Singling-out risk: {privacy_report['singling_out_risk']:.4f}")
print(f"Inference risk: {privacy_report['inference_risk']:.4f}")
Compliance Impact:
- Training time: 12 minutes (vs. 3 minutes without DP) mostly
- Utility trade-off: Minimal (empirical scores remain 85%+) mostly
- Privacy guarantee: Mathematically proven, accepted by regulators gdprlocal
- Epsilon tracking: Prevents budget over-spend
GDPR/HIPAA Alignment: Differential privacy provides Article 5 "privacy by design" compliance—not just pseudonymization, but mathematical proof of de-identification. This shifts synthetic data from "risky with caveats" to "compliant by design." gdprlocal
Evaluation: Separating Quality from Hype
The privacy-utility trade-off is real. You can't optimize for both simultaneously—higher privacy destroys utility, and maximizing utility re-introduces privacy risks. The FEST framework provides systematic assessment. arxiv
Utility Metrics (Does it work for ML?)
| Metric | What It Measures | Acceptable Threshold |
|---|---|---|
| Column Shapes | Do univariate distributions match? | >85% |
| Column Pair Trends | Are correlations preserved? | >80% |
| KL Divergence | Statistical distance to real data | <0.05 (lower is better) |
| Downstream ML Utility | Does a model trained on synthetic perform on real data? | >90% accuracy match |
| Diversity | Are all subpopulations represented? | No coverage gaps |
Real benchmark (CTGAN vs. TVAE): thesai TVAE achieves higher KL Divergence (better distribution match) but CTGAN is preferred for privacy because it's agnostic to actual data values. Trade-off wins depend on use case.
Privacy Metrics (Are individuals protected?)
| Risk Category | Attack Type | Acceptable Level | How to Measure |
|---|---|---|---|
| Singling Out | Isolate a unique individual | <0.01 (1%) | Proportion of unique records |
| Linkage | Cross-reference with external data | <0.05 (5%) | Re-identification via other datasets |
| Inference | Guess hidden attribute of individual | <0.35 | Predictability of sensitive columns |
| Membership | Determine if someone was in training set | <0.55 | Model confidence on in-distribution vs. out-of-distribution |
| Exact Match | Find real records in synthetic data | 0 (must be zero) | Count of duplicates |
Healthcare example (IoT intrusion detection): pmc.ncbi.nlm.nih
- Singling-out risk: 0.0069 (safe)
- Linkability: 0.001 (safe)
- Inference: 0.35 (balanced—utility preserved)
- KS test: 0.80 (high fidelity)
This is achievable with DP-CTGAN using smooth sensitivity noise injection + quantile matching + dynamic KS adjustment—the algorithm in the research paper. pmc.ncbi.nlm.nih
Evaluation Workflow in Code
from fest_evaluation import FEST # Open-source framework
import pandas as pd
# Load real and synthetic data
real_data = pd.read_csv('real_transactions.csv')
synthetic_data = pd.read_csv('synthetic_transactions.csv')
# Initialize FEST framework
fest = FEST()
# Evaluate across all dimensions
results = fest.evaluate(
real_data=real_data,
synthetic_data=synthetic_data,
quasi_identifiers=['customer_id', 'zip_code', 'dob'], # Quasi-ID columns
sensitive_columns=['income', 'medical_condition']
)
# Get holistic assessment
print(f"Fidelity Score: {results['fidelity']:.2%}") # How real does it look?
print(f"Utility Score: {results['utility']:.2%}") # How useful for ML?
print(f"Privacy Score: {results['privacy']:.2%}") # How safe?
print(f"Recommended Use: {results['recommended_use']}") # Production? Testing? Limited?
# Visualize privacy-utility trade-off curve
results.plot_privacy_utility_curve()
Decision Rule: For production use, all three must pass. If privacy ≥85%, utility ≥80%, fidelity ≥90%, the dataset is safe for external sharing. Below these thresholds, use internally only or combine with real data.
Enterprise Best Practices: Deployment Architecture
Real enterprises don't deploy pure synthetic data. They build hybrid pipelines:
Pattern 1: Core-Expansion Architecture
Real Data (100K records, curated)
↓
Identify Coverage Gaps (edge cases, rare events)
↓
Generate Synthetic Data (10K records for gaps)
↓
Mix (10:1 real-to-synthetic ratio)
↓
Train Model
↓
Validate on Real-World Production Data
↓
Deploy with Monitoring
Example (Payment fraud): ING Belgium trained on 5K real SEPA payments + 10K synthetic variants covering stress scenarios, achieving 100x test coverage without privacy exposure. datacebo
Pattern 2: Privacy-First with DP Guarantee
Sensitive Real Data (healthcare, financial)
↓
Train Generator with DP-SGD (epsilon=1.0)
↓
Generate Synthetic Data (mathematically de-identified)
↓
Publish/Share Safely (GDPR compliant, no re-identification risk)
↓
Use for: Model training, BI analytics, external research
↓
Regulatory Audit Trail (DPIA documented, privacy budget tracked)
Impact: Moves from "sensitive data, restricted access" to "safe data, shareable with regulators."
Pattern 3: Synthetic-for-Augmentation (Rare Events)
Original Dataset (imbalanced, 0.5% positive class)
↓
Oversample Minority Class Synthetically (SMOTE or GAN-based)
↓
Train on Balanced Mix
↓
Validate on Real Imbalanced Test Set
Use case: Rare disease diagnosis, fraud detection, anomaly detection. Prevents model collapse on minority class.
Tool Comparison: Choosing Your Stack
| Tool | Use Case | Cost | Setup Time | Privacy | Accuracy | Best For |
|---|---|---|---|---|---|---|
| Faker | Quick test data | Free | 5 min | None | Low | Dev/QA |
| SDV (Gaussian Copula) | Fast baseline | Free | 10 min | None | Medium | Research |
| SDV (CTGAN) | Production research | Free | 30 min | None | Medium-High | Benchmarking |
| MOSTLY.AI | Enterprise DP | Custom | 1 hour | DP built-in | High (97.8%) | Regulated industries |
| Gretel Synthetics | Managed platform | $295-$10K/mo | 2 hours | Optional DP | High | Team/SaaS |
| SmartNoise | DP-first analytics | Free | 30 min | DP only | Medium | Privacy researchers |
| Synthcity | Fairness + privacy | Free | 45 min | Modular | High | Academic/fairness-critical |
For Bangladesh-based AI teams: Start with open-source SDV (free, excellent docs). Scale to MOSTLY.AI if you need DP compliance for enterprise clients. Avoid proprietary platforms unless you have specific vendor lock-in reasons.
The Privacy-Compliance Reality Check
This is the part vendors don't emphasize: synthetic data is not automatically GDPR/HIPAA compliant.
The Legal Standard
GDPR Article 5 requires data to be "not reasonably identifiable by any means." This is a legal standard, not a technical one. Synthetic data fails this test if: gdprlocal
- Re-identification is possible through cross-referencing with external datasets linkedin
- Outliers are preserved (unusual combinations hint at real people) bluegen
- No privacy evaluation was done (regulators expect DPIA documentation) em360tech
What Actually Works
-
✅ Differential privacy during training (mathematical proof of de-identification) github
-
✅ Rigorous re-identification testing (membership inference, attribute inference attacks) bluegen
-
✅ Documented DPIA with privacy budget tracking gdprlocal
-
✅ Governance framework specifying retention, access, audit trails gdprlocal
-
⌠Assuming "synthetic = anonymous" em360tech
-
⌠Skipping privacy evaluation (95% of papers do this) nature
-
⌠Using high-fidelity synthesis without privacy controls (defeats the purpose)
-
⌠Transferring synthetic data internationally without contractual safeguards gdprlocal
Practical rule: For external sharing, synthetic data must either (1) use differential privacy with epsilon <1.0 or (2) pass formal re-identification testing by independent auditors. gdprlocal
Real-World Performance Benchmarks
Benchmark 1: ING Belgium (Payment Processing)
- Real data: 5K historical SEPA payments
- Synthetic generation: 10K payments in 2 minutes
- Test coverage: 100x improvement vs. manual test cases
- Time saved: 1/10th the manual effort
- Deployment: Production SEPA processing for millions of users datacebo
Benchmark 2: MOSTLY.AI vs. SDV (Sequential Data)
| Metric | SDV | MOSTLY.AI | Winner |
|---|---|---|---|
| Overall Accuracy | 52.7% | 97.8% | MOSTLY.AI (+85%) |
| Bivariate Analysis | 35.4% | 89.3% | MOSTLY.AI (+60%) |
| Sequential Coherence | 18.3% | 94.5% | MOSTLY.AI (+76%) |
| Privacy (DCR Share) | Similar | Similar | Tie |
| Training Speed | Fast | Moderate | SDV |
SDV excels at univariate analysis (71.7%) but struggles with complex relationships. MOSTLY.AI dominates when preserving sequential patterns matters (time-series, financial data). mostly
Benchmark 3: DP-CTGAN for IoT Intrusion Detection
With differential privacy + smart noise injection:
- Singling-out risk: 0.0069 (safe)
- Inference risk: 0.35 (acceptable trade-off)
- KS test score: 0.80 (high utility)
- Training overhead: ~4x slower, but privacy-guaranteed pmc.ncbi.nlm.nih
Getting Started: Your Implementation Roadmap
Phase 1: Proof of Concept (Week 1-2)
- Choose a dataset: Start with 1K-10K records, non-sensitive
- Generate with Faker + SDV: Baseline quality assessment
- Evaluate: Run FEST framework, visualize privacy-utility curve
- Decision: "Should we scale this?"
pip install sdv faker pandas numpy
python synthetic_poc.py # See code snippets above
Phase 2: Production Setup (Week 3-6)
- Pick your tool based on compliance needs (SDV for research, MOSTLY.AI for DP)
- Build evaluation pipeline with FEST + privacy metrics
- Hybrid data strategy: Identify real/synthetic mix ratio for your use case
- Document governance: DPIA, privacy budget, retention policy
Phase 3: Scale & Monitor (Week 7+)
- Integrate into ML pipeline (training data generation as code)
- Monitor privacy-utility metrics in production
- Version synthetic generators like code (track epsilon, model parameters)
- Regulatory audit trail: Maintain logs for compliance audits
Common Pitfalls & Solutions
| Problem | Symptom | Solution |
|---|---|---|
| Mode Collapse | Synthetic data repeats same patterns | Reduce learning rate, increase batch size, use TVAE instead of GAN |
| Privacy Leakage | Re-identification attacks succeed | Add differential privacy (epsilon <1.0), test with membership inference |
| Low Utility | Models trained on synthetic fail on real | Increase fidelity (reduce privacy constraints), use hybrid real+synthetic |
| Skewed Distributions | Rare classes vanish | Oversample minority class synthetically, use conditional sampling |
| Slow Training | CTGAN takes hours | Switch to Gaussian Copula (10x faster), use TVAE instead |
| Overfitting to Real Data | Synthetic = copies of real | Validate exact match score (should be 0), use regularization |
The Business Case: When ROI Justifies Synthetic Data
Synthetic data ROI is proven in specific scenarios:
High-ROI Use Cases
-
Regulated industries (healthcare, finance, telecom)
- Cost: Privacy compliance infrastructure
- Savings: Breach avoidance ($11M+ each), faster deployment
- ROI: 5.9-13x average
-
Rare event modeling (fraud, anomalies, edge cases)
- Cost: Synthetic generation ($500 per edge case)
- vs. Real data collection ($50K per edge case)
- ROI: 100x on edge case costs linkedin
-
Rapid experimentation (A/B testing, ML iteration)
- Cost: Synthetic data generation (minimal—hours)
- vs. Real data access (weeks of approval)
- ROI: Time-to-market acceleration (6-12 month reduction)
-
Multi-team data sharing
- Cost: Once-only synthetic dataset generation
- vs. Multiple privacy requests, data silos, approval delays
- ROI: 10x+ on operational efficiency
Low-ROI Use Cases (Avoid)
- General analytics on public datasets (use real data)
- Non-sensitive use cases (privacy constraints don't apply)
- High-precision applications without validation (synthetic may fail silently)
Looking Ahead: Synthetic Data in 2026
The landscape is shifting rapidly:
- Regulation tightening: EU EDPS now expects synthetic data use cases to prove privacy compliance, not assume it. gdprlocal
- LLM-based generation: Fine-tuning foundation models with DP-SGD for text/document synthesis is becoming standard. research
- Multimodal synthesis: Vision + text + time-series in single generators (NVIDIA NeMo, academic labs). nvidia
- Edge case automation: Using LLMs to identify and generate missing scenarios without human annotation.
- Synthetic data as product: Companies like MOSTLY.AI and Gretel moving from one-off tools to managed platforms.
By 2026, synthetic data won't be "experimental." It'll be infrastructure—as standard as version control for code, but for data.
Conclusion: From Scarcity to Scale
Synthetic data solves the core bottleneck of modern AI: data scarcity paired with privacy constraints. You can now:
- Generate edge cases at $500/case instead of $50,000
- Reduce timelines from 12 months to weeks (ING: 10K records in 2 minutes)
- Ensure compliance with GDPR/HIPAA through differential privacy
- Share data safely with regulators, partners, research teams
The technology works. The tools are mature (SDV: 10M downloads, MOSTLY.AI: enterprise deployments). The frameworks exist (FEST, PrivEval, SafeSynthDP).
The remaining barriers aren't technical—they're organizational. Privacy teams worry about compliance (solved: differential privacy + DPIA). ML teams doubt quality (solved: evaluation frameworks > 90% match possible). Business leaders question ROI (solved: proven 5.9-13x returns).
Your next step: Pick a pilot use case (fraud detection, rare disease research, test data generation). Run the proof-of-concept in Week 1-2 using the code above. By Week 3, you'll have hard data on privacy-utility trade-offs specific to your domain.
That's how enterprises move from "should we use synthetic data?" to "how do we scale it?"
Ready to Build Privacy-First AI Systems?
Synthetic data is infrastructure. Your team needs a partner who understands both the technical depth (differential privacy, GAN architectures, evaluation metrics) and business reality (GDPR timelines, data governance, ROI justification).
Building privacy-first AI systems? Partner with Bangladesh's enterprise AI consultant specializing in:
- Synthetic data pipelines for regulated industries
- Differential privacy implementation (DP-SGD, Opacus integration)
- GDPR/HIPAA compliance for AI training
- Bengali language NLP datasets with privacy-preserving synthesis
- Cost-optimized cloud infrastructure for generative models (GCP, Vertex AI)
We've built synthetic data systems for telecom companies, fintech platforms, and healthcare providers across South Asia. We know the regulatory landscape, the cost trade-offs, and how to implement without months of overhead.
[Book a consultation] to discuss your specific use case: fraud detection, clinical data augmentation, customer behavior modeling, or language-specific AI training.
Sources & Further Reading
IBM Synthetic Data Generation (2024) ibm WEF AI Training Data Solutions (2025) weforum NVIDIA NeMo Data Designer nvidia InvisibleTech AI Training 2026 Report invisibletech EM360Tech GDPR Compliance (2025) em360tech CyberGarden 15 Synthetic Data Tools (2025) cybergarden Cogent Info Synthetic Data Cost Reduction (2025) linvelo University of Florida GDPR + Synthetic Data (2025) scholarship.law.ufl MOSTLY.AI vs. SDV Comparison (2025) mostly MOSTLY.AI Sequential Data Benchmarks (2025) mostly DataCebo SDV 10M Downloads (2025) datacebo KeyMakr ROI Metrics keymakr LinkedIn Synthetic Data Cost Analysis linkedin ING Belgium Case Study (2025) datacebo Deloitte Analytics Case Study syntheticus Technavio Market Analysis (2025) rootsanalysis NIST Differential Privacy github Google DP-SGD Research (2026) research Cyberarctica Healthcare Breach Costs cyberarctica GDPR Local Synthetic Data Compliance (2025) gdprlocal MOSTLY.AI Differential Privacy Feature mostly Nature Healthcare Data Review (2025) nature BlueGen Privacy-Utility Trade-off Framework (2025) bluegen SDV Evaluation Framework pypi FEST Framework Evaluating Synthetic Tabular Data arxiv KL Divergence CTGAN vs. TVAE Benchmark thesai Privacy-Preserving IoT IDS with DP-CTGAN pmc.ncbi.nlm.nih