Synthetic Data: How Mid-Market Companies Train AI Without Falling Foul of the GDPR
European companies face a paradox: AI needs data to learn, but the most valuable data — customer records, patient files, financial transactions — is locked behind GDPR, compliance departments, and regulatory requirements. Anonymization alone often falls short because re-identification remains realistic with small datasets. The consequence: AI projects are delayed by months or abandoned entirely.
Synthetic data solves this problem at the root. It replicates the statistical properties of real data — distributions, correlations, patterns — without containing a single real data point. Because synthetic datasets contain no personal data, they fall outside the scope of the GDPR. No consent management, no data processing agreements for training data, no data protection impact assessment for the dataset itself.
Why European Companies Specifically Need Synthetic Data
The GDPR as a Structural Bottleneck
The GDPR is not the only obstacle, but it is the most consequential. Every processing of personal data requires a legal basis under Art. 6 GDPR — including training a machine learning model. Obtaining consent from data subjects for training purposes is virtually impossible in practice: how do you explain to thousands of customers what exactly a neural network does with their data?
Additionally, Spain's data protection authority AEPD clarified in 2025 that generating synthetic data from personal source data is itself a processing activity under the GDPR. This means the generation process requires a legal basis — but the resulting synthetic dataset is freely usable. Understanding this two-step logic unlocks enormous room for action.
The technical and legal foundations for GDPR-compliant AI systems — including anonymization techniques, hosting options, and data processing agreements — are covered in our dedicated guide.
Small Datasets, Big Ambitions
Mid-market companies rarely have millions of data points. A mechanical engineering firm with 200 customers does not have a big-data foundation for a predictive maintenance model. A specialized laboratory with 5,000 test results per year has insufficient data for a robust anomaly detection model. Synthetic data can expand these datasets — not through duplication, but through statistically valid augmentation that specifically supplements underrepresented scenarios.
Regulated Industries Under Double Pressure
Healthcare, financial services, and manufacturing face industry-specific regulations on top of the GDPR: BaFin requirements in Germany, the Medical Devices Regulation, ISO standards for quality management. Synthetic data significantly reduces the compliance burden by decoupling regulatory complexity from the data layer. The AI model trains on synthetic data — the real data stays in the secured infrastructure.
Three Use Cases That Deliver Immediate Impact
Synthetic data is not a theoretical concept — it is deployed in production today. The following three use cases demonstrate where synthetic data provides the greatest leverage: machine learning model training, software testing, and market simulation without customer tracking.
1. ML Model Training Without Personal Data
The classic use case: a company wants to train a classification model — for customer segmentation, churn prediction, or fraud detection — but the training data contains names, addresses, account numbers, and transaction histories.
The synthetic approach: A generation model learns the statistical relationships in the original dataset — which features correlate, what distributions exist, which clusters emerge — and produces a new dataset with identical statistical properties but no real individuals. The trained ML model learns the same patterns without ever seeing a single real customer data point.
IBM provides synthetic datasets for credit card fraud, insurance claims, and anti-money laundering scenarios. Swiss insurer Die Mobiliar validated the approach for churn prediction and confirmed that synthetic data preserves model quality while ensuring data protection compliance.
2. Software Testing with Realistic Data
Development teams need test data that reflects the complexity of real data: edge cases, unusual formatting, missing fields, extreme values. Static test fixtures are insufficient. Using production data is a GDPR violation.
The synthetic approach: Generated test data replicates the distributions and anomalies of real production data — including rare edge cases that manually created test data would never cover. This improves test coverage and prevents production bugs that only surface with specific data constellations.
Particularly relevant for companies integrating AI systems into existing IT landscapes: synthetic test data enables realistic integration tests between ERP, CRM, and AI components without copying production data into test environments.
3. Market Simulation Without Customer Tracking
How do customers react to a price change? What happens when a new product is launched in a specific segment? Traditionally, such analyses require extensive customer data and tracking.
The synthetic approach: From historical transaction data, a synthetic market model is generated that maps purchasing behaviour, price sensitivity, and segment dynamics. Scenarios can be explored without tracking a single real customer. This is not only GDPR-compliant — it also enables simulations for scenarios where no historical data exists yet.
Real vs. Synthetic Data: When to Use Which
Real data, anonymized data, and synthetic data each have specific strengths and limitations. The right choice depends on GDPR risk, the required statistical fidelity, and the intended use case. Not every use case requires synthetic data. The decision depends on data sensitivity, data volume, and regulatory context:
| App | Real Data | Anonymized Data | Synthetic Data |
|---|---|---|---|
| GDPR-relevant | Ja | Eingeschränkt | — |
| Re-identification risk | Ja | Eingeschränkt | — |
| Statistical fidelity | 100% | 85–95% | 90–98% |
| Data augmentation possible | — | — | Ja |
| Free sharing | — | Eingeschränkt | Ja |
| Creation effort | — | Medium | Medium to high |
| Edge case generation | — | — | Ja |
| Suitable for ML training | Ja | Ja | Ja |
| Suitable for software testing | — | Eingeschränkt | Ja |
What a Synthetic Data Pipeline Looks Like
A synthetic data pipeline consists of five phases: data profiling, model selection, generation, validation, and deployment. Each phase has specific quality criteria that ensure the output is statistically valid, privacy-compliant, and fit for the target purpose. Generating synthetic data is not a one-click operation. It follows a five-stage pipeline that ensures the output is statistically valid, privacy-compliant, and fit for the target purpose.
Step 1: Data Profiling
Before synthetic data can be generated, the original dataset must be understood. Automated profiling analyses:
- Distributions of each column (numerical, categorical, temporal)
- Correlations between features (which fields are interdependent?)
- Data quality — missing values, outliers, inconsistencies
- Relational structures — foreign keys, one-to-many and many-to-many relationships between tables
Profiling tools like Great Expectations or ydata-profiling generate automated reports that serve as the baseline for subsequent validation. Companies already operating an AI data infrastructure can leverage their existing data quality frameworks directly.
Step 2: Model Selection and Configuration
Different generation models are used depending on the data type:
- Tabular data: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), or Gaussian Copula Models
- Time series data: Sequential models (e.g. TimeGAN) that preserve temporal dependencies
- Relational data: Multi-table synthesizers that maintain foreign key relationships and referential integrity
- Text data: LLM-based generation with controlled variance and privacy constraints
The model choice directly affects statistical fidelity and privacy guarantees. More complex models deliver higher accuracy but require more compute and tuning.
Step 3: Generation
The configured model produces the synthetic dataset. Critical parameters:
- Dataset size: Does not need to match the original — augmentation deliberately generates more data points for underrepresented classes
- Privacy budget: In differential privacy approaches, epsilon determines the trade-off between accuracy and privacy
- Seed control: Reproducibility of generation for audit purposes
Step 4: Validation
The generated dataset is validated against the original — without direct access to the source data (hold-out validation):
- Statistical tests: Kolmogorov-Smirnov test for distributions, correlation matrix comparison
- ML utility test: A model is trained on synthetic data and evaluated on real test data. The performance gap versus a model trained purely on real data is the key quality metric.
- Privacy tests: Nearest-neighbour distance between synthetic and real data points. Distances that are too small indicate memorization — a privacy risk.
- Plausibility check: Domain expertise validates whether the synthetic data represents realistic business scenarios.
Step 5: Deployment and Monitoring
Synthetic datasets are stored with versioning and documentation — including metadata on the generation model, configuration, validation results, and intended use. In production, synthetic data is regularly regenerated when distributions in the original dataset shift (data drift).
Legal basis required
Generating synthetic data from personal source data is itself a processing activity under the GDPR. The generation process requires a legal basis — typically legitimate interest (Art. 6(1)(f)). The resulting synthetic dataset, however, is freely usable as long as re-identification is not possible.
Tool Landscape: Three Approaches Compared
The market for synthetic data generation consolidated in 2025/2026. Three categories dominate: enterprise platforms like MOSTLY AI for regulated industries, developer-friendly API platforms like Gretel for data science teams, and the open-source library SDV for teams with Python expertise and full control requirements.
MOSTLY AI — Enterprise Platform for Regulated Industries
MOSTLY AI specializes in financial services and healthcare. The platform transforms production data into privacy-safe synthetic versions through a six-step process with automated model training. Strength: high accuracy (97.8% in benchmarks), integrated privacy tests, EU hosting option.
Gretel — Developer-Friendly API Platform
Gretel targets data science teams that want to embed synthetic data generation into existing engineering workflows. API-first approach with support for tabular, text, and image data. Strength: flexible pipeline integration, hybrid deployment, broad data type support.
SDV — Open-Source Library
Synthetic Data Vault (SDV) is a Python library for tabular, relational, and time series data. Open source, free, full control. Strength: no vendor lock-in, locally deployable, ideal for teams with data science expertise. Weakness: requires more manual setup and tuning than managed platforms.
Vorteile
- GDPR-compliant when implemented correctly
- Unlimited dataset size possible
- Free sharing with partners and service providers
- Edge case generation for more robust models
- No consent required for training data
Nachteile
- Generation process itself is GDPR-relevant
- Quality depends heavily on source data
- Validation effort is non-trivial
- Complex relational structures are challenging
- Domain expertise needed for model selection and tuning
Decision Checklist: Does Synthetic Data Fit Your Project?
Synthetic data is not a silver bullet — but for many companies it is the most efficient path to GDPR-compliant AI training. The following seven criteria help assess whether a proof-of-concept makes sense for your project. Not every AI project needs synthetic data. Check these seven criteria:
- Personal data in training? → If yes, synthetic data is a strong alternative to anonymization.
- Dataset too small for robust training? → Synthetic augmentation can specifically expand underrepresented classes.
- Realistic test data needed? → Synthetic test data replicates production complexity without compliance risk.
- Sharing data with third parties? → Synthetic datasets can be shared freely — ideal for partner integrations and outsourcing.
- Regulated industry? → Synthetic data significantly reduces compliance effort for audit requirements and industry-specific regulation.
- Sufficient data quality in the original? → Synthetic data is only as good as its source. Poor original data produces poor synthetic data.
- Data science competence in-house? → Managed platforms (MOSTLY AI, Gretel) lower the entry barrier. SDV requires Python expertise.
FAQ: Synthetic Data for Mid-Market Companies
Is synthetic data really GDPR-compliant?
Yes — the resulting synthetic dataset contains no personal data and therefore falls outside the scope of the GDPR. However, the generation process, where the model learns from real personal data, is itself a data processing activity and requires a legal basis. Typically, legitimate interest (Art. 6(1)(f) GDPR) with a documented balancing test applies here. Details on GDPR legal bases for AI applications can be found in our guide to GDPR-compliant AI.
How does the quality of synthetic data compare to real data?
Modern generation tools achieve 90–98% statistical fidelity compared to the original data. In ML utility tests — where a model is trained on synthetic data and evaluated on real data — the performance loss is typically 2–5 percentage points. For many use cases, this is acceptable, especially when the alternative is no training at all.
Can synthetic data enable re-identification?
Not when implemented correctly. Privacy tests (nearest-neighbour distance, membership inference attacks) validate that no synthetic data point is too close to a real one. Differential privacy methods provide mathematical guarantees. However, poorly configured generation can lead to memorization — which is why the validation phase is non-negotiable.
What does it cost to get started with synthetic data?
Open-source tools like SDV are free. Managed platforms start at approximately EUR 500–2,000 per month depending on data volume and features. The largest cost factor is not the tool but the expertise: data profiling, model configuration, and validation require data engineering competence. A proof-of-concept with a single dataset is achievable in 2–4 weeks.
How does synthetic data relate to the EU AI Act?
The EU AI Act requires documented training data with demonstrable quality and freedom from bias for high-risk AI systems. Synthetic data can help here: it enables targeted bias correction through controlled generation and complete documentation of the data generation process — both requirements of the AI Act.
Conclusion: Synthetic Data as a Strategic Enabler
The key takeaway: synthetic data is no longer a niche topic. It is a strategic lever for any company that wants to implement AI projects in a GDPR-compliant, scalable way — even with limited datasets. The technology is mature, the tools are available, and the legal framework is clear.
The critical step is not tool selection — it is the clean implementation of the pipeline: from profiling through generation to validation. Mastering this process unlocks AI applications that would be impossible or illegal without synthetic data.
Want to build synthetic data pipelines for your AI project? At IJONIS in Hamburg, we guide companies from feasibility study to production-ready pipeline. Get in touch — we analyse your data landscape and develop a proof-of-concept that shows whether synthetic data delivers the breakthrough for your use case.
How ready is your company for AI? Find out in 3 minutes with our free, AI-powered readiness check. Start the check now →


