AI & Insurtech

Synthetic Data for Insurance Model Training: Privacy, Regulatory Acceptance, and Use Cases in India

Indian insurers are turning to synthetic data generation to train underwriting, pricing, and fraud models without breaching the DPDP Act 2023, with practical guidance on GANs, diffusion models, tabular synthesis tools, differential privacy accounting, and IRDAI expectations for model validation.

Sarvada Editorial TeamInsurance Intelligence

April 15, 202616 min read

synthetic-datadata-privacydpdp-actmodel-trainingdifferential-privacyai-insurtechunderwriting-modelsfraud-detectiontabular-synthesisirdai

Last reviewed: April 2026

Why Indian Insurers Are Investing in Synthetic Data Generation in 2026

The operational pressure driving synthetic data adoption in Indian insurance has two sources that converged sharply in late 2025. The first is the Digital Personal Data Protection Act 2023, which came into effective enforcement through the Data Protection Board in early 2025. The Act requires that personal data be processed only for specific purposes with explicit consent, that processing be proportionate, and that data not be retained beyond the period necessary. Training a machine learning model on a dataset containing policyholder names, addresses, medical histories, or claim details is a form of processing that triggers every one of those obligations. Insurers that historically treated policy and claims archives as a fungible training resource now face a legal question before every model training run.

The second pressure is the scarcity of data for certain risks. Indian commercial underwriters modelling cyber risk, large marine hull losses, product recall events, or infectious disease business interruption face the fundamental problem that these events are rare. A single large insurer might have fewer than 50 genuine cyber incident claims in its entire history, and fewer than 10 with complete forensic detail. Training a credible pricing or reserving model on this data is statistically impossible. Pooling across insurers through the Insurance Information Bureau helps, but coverage of specialty lines in the IIB is thin and lags by 12 to 24 months.

Synthetic data is being positioned as a response to both pressures simultaneously. A properly generated synthetic dataset preserves the statistical properties of the real data (distributions, correlations, conditional dependencies) while containing no records that correspond to any real policyholder. For privacy, this removes the DPDP Act 2023 burden because synthetic data, by definition, is not personal data. For data scarcity, synthetic data can extend a small real dataset by generating plausible additional records that fill in the tails of the distribution. The combination has moved synthetic data from an academic curiosity in 2023 to an active investment area for every top-10 Indian general insurer in 2026.

The size of this investment is material. Bajaj Allianz, ICICI Lombard, HDFC Ergo, and Tata AIG have collectively committed an estimated INR 80 to 120 crore over 2025 and 2026 to synthetic data infrastructure, vendor partnerships, and internal capability building. The state-owned insurers are moving more cautiously, with LIC and New India Assurance evaluating pilot programmes but not yet committing to production deployment.

Techniques: GANs, Diffusion Models, Agent-Based Simulation, and Tabular Synthesis

Synthetic data generation is not a single technology. Four families of techniques are in active use in Indian insurance, each suited to different data types and use cases.

Generative Adversarial Networks (GANs) were the dominant technique in 2022 and 2023 and remain relevant for image and document synthesis. A GAN consists of two neural networks: a generator that produces synthetic samples and a discriminator that tries to distinguish real from synthetic. The two networks train against each other until the generator produces samples that fool the discriminator. For insurance, GANs are used to generate synthetic document images (synthetic policy schedules, synthetic claim forms) for training optical character recognition and document classification models. The limitation of GANs is training instability, particularly on high-dimensional tabular data with mixed continuous and categorical columns.

Diffusion models have largely overtaken GANs for high-fidelity image and sequential data generation. A diffusion model learns to progressively denoise random noise into realistic samples. For insurance, diffusion models are being used to generate synthetic medical images for health insurance fraud detection (training models to identify fabricated diagnostic reports) and synthetic vehicle damage images for motor claims. The computational cost is higher than GANs but the output quality and training stability are materially better.

Tabular synthesis tools are the workhorses for policy and claims data, which is the largest and most commercially valuable data category in insurance. CTGAN (Conditional Tabular GAN) and TVAE (Tabular Variational Autoencoder) are the best-known algorithms, both available through the open-source Synthetic Data Vault (SDV) library. These tools handle the peculiarities of insurance tabular data: mixed continuous and categorical columns, heavy-tailed distributions on claim amounts, conditional dependencies between fields (claim amount depends on peril, which depends on line of business), and missing values. Indian insurers deploying synthetic data for underwriting model training typically start with SDV or commercial equivalents before considering bespoke architectures.

Agent-based simulation is a fundamentally different approach that does not learn from existing data but constructs synthetic data from first principles. The modeller specifies agents (policyholders, perils, loss mechanisms) and the rules of their interaction. The simulation generates synthetic claim histories consistent with the specified behaviour. Agent-based simulation is particularly useful for modelling systemic events (pandemic spread, cyber incident propagation, catastrophe clustering) where no historical data adequately captures the tail. Indian reinsurers including GIC Re have used agent-based catastrophe simulation for decades; the novelty is applying similar techniques to synthesise underwriting training data at scale.

The choice of technique depends on the data type, the use case, and the privacy budget. Image data calls for GANs or diffusion models. Tabular policy and claims data calls for CTGAN, TVAE, or commercial tabular synthesis. Rare systemic events call for agent-based simulation, potentially combined with tabular synthesis for the base portfolio. Most production deployments in Indian insurance combine at least two techniques in a single pipeline.

DPDP Act 2023 and the Privacy Advantage of Synthetic Data in India

The DPDP Act 2023 defines personal data as 'any data about an individual who is identifiable by or in relation to such data.' Synthetic data, if properly generated, does not correspond to any real individual and therefore falls outside the Act's scope. This is the core privacy advantage, and it is the foundation of the business case for synthetic data in Indian insurance.

But the qualifier 'properly generated' is doing heavy work. A synthetic dataset that inadvertently reproduces records from the training data, or that can be reverse-engineered to identify individuals in the training data, is not truly anonymous and may still be subject to the DPDP Act. Indian insurers are developing internal protocols to verify the privacy properties of synthetic data before deployment.

Three verification layers are becoming standard. The first is distance-based similarity checking: no synthetic record may be within a minimum distance of any training record in the feature space. If a generated synthetic policyholder has exactly the same age, location, industry, and prior claims pattern as a real policyholder, the synthetic record is rejected. The second is membership inference testing: can an adversary, given access to the synthetic dataset, determine with better-than-random accuracy whether a specific individual was in the training data? Standard membership inference attacks (shadow models, loss-based attacks) are run against the synthetic output, and acceptable privacy requires attack accuracy close to random. The third is differential privacy accounting: if the synthesis algorithm provides formal differential privacy guarantees, the privacy budget (epsilon and delta parameters) is tracked and limited.

Differential privacy deserves specific attention because it is the only mathematical framework that provides formal privacy guarantees. Differentially private synthesis algorithms add calibrated noise during training such that the synthetic output is provably insensitive to any single record in the training data. The tradeoff is utility: stronger privacy (lower epsilon) produces synthetic data with worse statistical fidelity to the real data. Indian insurers typically operate in a range of epsilon between 1 and 10 depending on the sensitivity of the underlying data. Health insurance data, which includes medical history, tends toward the lower end. Commercial property data, which is less personally sensitive, tolerates the higher end.

The DPDP Act also recognises anonymisation as a legitimate data handling practice, and the Data Protection Board's draft rules published in late 2024 explicitly mention synthetic data as one form of anonymisation. This regulatory recognition has accelerated insurer adoption by removing legal uncertainty about whether synthetic data qualifies as non-personal data under the Act. Legal opinions sought by major insurers in 2025 have consistently confirmed that properly generated, verified synthetic data is outside the DPDP Act's scope, though the insurer retains responsibility for ensuring the synthesis process does not leak personal data.

Use Cases: Rare-Event Modelling, Fraud Training, SME Underwriting, and Cross-Border Data Sharing

Four use cases dominate synthetic data deployment in Indian insurance in 2026.

Rare-event modelling addresses the problem of insufficient historical data for low-frequency high-severity events. Cyber incident claims, large marine hull losses, product recall events, directors and officers claims, and specialty liability claims all share the same statistical problem: not enough historical observations to train a credible model. Synthetic data generated through agent-based simulation or through conditional tabular synthesis with rare-event boosting extends the dataset. A typical cyber pricing model might train on 50 real claims supplemented by 2,000 synthetic claims that preserve the statistical properties of the real incidents while filling in the tail of the distribution. The model's generalisation performance on unseen real claims improves measurably, though care is required to avoid the model overfitting to synthesis artefacts.

Fraud pattern training is the second major use case. Labelled fraud data is inherently scarce because confirmed fraud represents a tiny fraction of total claims. Synthetic fraud data, generated to include the specific relational patterns and anomalies that characterise known fraud schemes, allows fraud detection models to train on a more balanced dataset. The technique is particularly effective for organised fraud ring detection where the statistical signature of a ring (specific patterns of connections between entities) can be synthesised at scale. Graph synthesis techniques, a specialised form of synthetic data generation for networked data, are being piloted by Go Digit and ICICI Lombard for this use case.

SME underwriting is the third major use case and has particular commercial significance. The Indian SME commercial insurance market has historically been underserved because traditional underwriting requires rich data (audited financials, detailed loss history, site inspections) that small enterprises cannot economically provide. Data-scarce underwriting relies on models that can infer risk from thin data. Training such models requires diverse examples of SME profiles, which no single insurer possesses. Synthetic SME portfolios, generated to reflect the distribution of Indian SMEs across industries, revenue bands, and geographic regions, allow insurers to train thin-data models without concentrating on their historical clientele.

Cross-border data sharing is the fourth use case and is driven by reinsurance relationships. Indian insurers regularly share portfolio data with international reinsurers for treaty placement and facultative reinsurance. The DPDP Act 2023 restricts cross-border data transfers to notified jurisdictions, which complicates the traditional practice of sharing raw policy-level data. Synthetic datasets that preserve portfolio characteristics without containing personal data can be freely shared, enabling reinsurance negotiations without the legal overhead of cross-border transfer compliance. Munich Re India, Swiss Re, SCOR, and Hannover Re have all indicated acceptance of synthetic portfolio data for preliminary placement discussions, with real data required only for final binding.

Validation Challenges: Fidelity versus Privacy, Re-identification Risk, and Statistical Testing

The central engineering problem in synthetic data is the tradeoff between fidelity (how closely the synthetic data matches the real data's statistical properties) and privacy (how thoroughly the synthesis removes information about individual records). Perfect fidelity would reproduce the training data exactly, which is zero privacy. Perfect privacy would produce random noise, which is zero fidelity. The engineering challenge is finding the right operating point for each use case.

Fidelity is measured through multiple dimensions. Marginal distribution matching checks that each column in the synthetic data has a similar distribution to the corresponding column in the real data. Joint distribution matching checks that pairs or groups of columns have similar correlations. Conditional distribution matching checks that the distribution of one column given values of another is preserved. Downstream task performance, the ultimate fidelity measure, checks whether a model trained on synthetic data performs similarly to a model trained on real data when both are evaluated on a held-out real test set. Indian insurers deploying synthetic data for production model training typically require downstream performance within 5 to 10% of real-data performance before accepting the synthetic dataset.

Privacy is measured through the verification layers described earlier: distance-based similarity, membership inference resistance, and differential privacy accounting. Re-identification risk is a specific concern for insurance because certain combinations of attributes (industry, location, revenue, prior claim history) can uniquely identify a commercial policyholder even without explicit identifiers. The k-anonymity principle requires that every combination of quasi-identifiers be shared by at least k records in the synthetic dataset. L-diversity extends k-anonymity to require diversity in sensitive attributes within each equivalence class. These classical anonymisation principles are applied as additional checks on top of the generative synthesis process.

Statistical testing presents a challenge because insurance data has characteristics that standard tests handle poorly. Heavy-tailed claim distributions, sparse categorical values (rare industry codes, uncommon peril combinations), conditional dependencies across many columns, and temporal autocorrelation in loss patterns all strain the standard fidelity metrics. Insurers deploying synthetic data in production typically supplement standard statistical tests with bespoke insurance-specific validations: does the synthetic claim frequency by peril match the real portfolio, does the synthetic loss ratio by industry match, does the synthetic claim size distribution preserve the catastrophe tail.

The validation effort is not a one-time project. Synthetic data must be revalidated when the underlying real data distribution shifts, which happens with changes in the insurer's book composition, with market cycle movements, or with regulatory changes affecting claim patterns. Insurers are building continuous validation pipelines that monitor synthetic data quality over time and trigger regeneration when drift exceeds thresholds.

IRDAI, CSO-Actuary Validation, and Reinsurer Acceptance of Synthetic Data Models

Regulatory and professional acceptance of synthetic-data-trained models is more subtle than the technology's privacy benefits. IRDAI has not issued specific guidance on synthetic data use in insurance modelling as of April 2026, but its 2025 communications on AI and ML in insurance modelling include references to the acceptability of synthetic data for model training, provided the insurer can demonstrate the synthetic data's statistical fidelity and privacy properties.

The Institute of Actuaries of India's Actuarial Practice Standards require that actuarial work rely on appropriate, reliable, and relevant data. The Chief Statutory Actuary (CSO-actuary) in each Indian general insurer is responsible for certifying that the models used for pricing, reserving, and capital calculations are actuarially sound. Synthetic data training creates a specific question for the CSO-actuary: is a model trained on synthetic data reliable enough for actuarial certification.

The emerging professional position is subtle. Synthetic data supplementation (adding synthetic records to a real dataset to improve model training) is broadly accepted provided the synthetic generation process is documented and the downstream model is validated on real held-out data. Pure synthetic training (a model trained entirely on synthetic data) is viewed with more caution, typically acceptable only for initial prototyping or for use cases where real data is genuinely unavailable (new lines of business, rare perils, prospective scenarios).

Reinsurer acceptance varies by reinsurer and by transaction type. For treaty placement negotiations, major reinsurers accept synthetic portfolio data as a basis for preliminary discussion, but require real data (under appropriate data sharing agreements) for final binding. For facultative placements of unusual risks, synthetic data may be acceptable throughout the process if it is clearly documented and both parties accept the modelling approach. For catastrophe bond structuring and insurance-linked securities, investor due diligence typically requires real data, limiting the role of synthetic data to the earlier stages of structuring.

The validation expectations are converging around a common set of artefacts. The insurer provides the CSO-actuary and, where relevant, the reinsurer with documentation of the synthesis process (technique, hyperparameters, privacy mechanism), fidelity validation results (statistical tests, downstream task performance), privacy validation results (similarity distance, membership inference, differential privacy parameters if applicable), and an assessment of residual risks. This documentation becomes part of the model validation file and supports the actuarial certification of the downstream model.

Vendor Market: Global Platforms, Indian Open-Source, and Insurance-Specific Tools

The synthetic data vendor market serving Indian insurance combines global platforms, open-source tools, and insurance-specific offerings.

Global platforms lead in breadth of capability. MOSTLY AI, an Austrian company with a dedicated India go-to-market since 2023, offers tabular synthesis with strong privacy guarantees and has secured deployments at at least two Indian general insurers. Gretel.ai provides a cloud-based synthesis platform with APIs for integration into insurer data pipelines and is used in pilot deployments at several insurers. Tonic.ai focuses on relational database synthesis and is deployed for test data generation supporting software development. Synthesized provides enterprise synthesis with governance features. All these platforms operate on a SaaS model with data processed in the vendor's cloud infrastructure, which creates data residency considerations that Indian insurers manage through customer-managed keys, in-region processing, and specific contractual commitments.

Open-source tools are widely used for exploratory work and for deployments where cloud SaaS is not acceptable. The Synthetic Data Vault (SDV) library, developed originally at MIT and maintained by DataCebo, is the de facto standard for tabular synthesis in Python. It provides implementations of CTGAN, TVAE, and several classical algorithms, along with the metadata specification needed to handle real-world tabular data. The PATE-GAN and differentially private variants of SDV algorithms provide formal privacy guarantees. YData Synthetic is another widely used open-source library. The primary cost of open-source adoption is the engineering effort to operationalise and maintain the synthesis pipeline; the tools themselves are free.

Indian initiatives are emerging alongside global platforms. Sarvada Intelligence, IIT Madras's data science research group, and several insurtech startups are developing synthesis tools calibrated to Indian insurance data characteristics. Indian policy and claims data has specific quirks (PIN code hierarchies, GST industry classifications, IRDAI peril codes) that global tools do not natively handle. Indian-built tools can provide better out-of-the-box performance for these data structures, though they typically lack the feature depth of mature global platforms.

Insurance-specific synthesis tools combine synthesis with insurance domain logic. These tools understand policy structures, peril taxonomies, claim lifecycle states, and reinsurance treaty structures, and they produce synthetic data that is not just statistically plausible but also insurance-valid. The category is still young, with a handful of offerings from global insurtech vendors and a growing set of Indian players. Early adoption is concentrated in use cases where insurance validity matters most: reinsurance placement data, catastrophe modelling inputs, and regulatory reporting simulations.

Cost-Benefit: Synthetic Data versus Real Data Acquisition and Operational Economics

The cost-benefit calculation for synthetic data in Indian insurance compares three alternatives: acquiring more real data, using existing real data with privacy constraints, and generating synthetic data.

Acquiring more real data is the traditional path. It includes purchasing data from brokers and aggregators, participating in the IIB and sector-specific data pools, and sharing data with reinsurers for pooled studies. Costs vary widely: IIB participation is a nominal regulatory requirement, data purchases from commercial providers range from INR 10 lakh to INR 5 crore per dataset depending on scope, and bespoke data collection (field surveys, risk inspections) can cost INR 50 lakh to INR 2 crore per campaign. The benefit is that the data is real and carries no fidelity questions. The drawbacks are the cost, the time required (typically 6 to 18 months for meaningful data acquisition), and the DPDP Act 2023 obligations that attach to personal data.

Using existing real data with privacy constraints is the cautious path. It requires legal review of every modelling use case, explicit consent management, data minimisation to the fields necessary for each model, access controls limiting who can view the data, and retention policies that force deletion after the purpose is fulfilled. The operational cost includes dedicated data privacy officers, legal counsel, and technical infrastructure (consent management platforms, data catalogues, retention engines). Annual operating costs for a mature data privacy programme at a mid-sized Indian insurer run INR 5 to 15 crore. The benefit is full fidelity. The drawbacks are the process overhead that slows modelling velocity and the residual regulatory risk from edge cases.

Generating synthetic data has its own cost structure. Platform licences (for commercial tools like MOSTLY AI or Gretel) run INR 1 to 5 crore annually. Open-source deployments cost in engineering time, typically INR 50 lakh to INR 2 crore for initial stand-up and ongoing maintenance. Compute costs for synthesis are moderate, perhaps INR 10 to 50 lakh annually for a mid-sized insurer's use cases. Validation costs, including statistical testing and privacy verification, add another INR 30 to 80 lakh annually in engineering and actuarial time. Total annual cost for a serious synthetic data programme runs INR 3 to 10 crore. The benefit is that the resulting synthetic data can be used freely within the insurer and shared externally without DPDP Act burden, and it can extend rare-event datasets beyond what real data permits. The drawbacks are the fidelity tradeoff, the need for careful validation, and the dependency on getting the synthesis process right.

In practice, Indian insurers are not choosing one of these paths exclusively. The mature approach is a portfolio: real data for model training where personal data is unavoidable and privacy overhead is acceptable, synthetic data for rare-event extension and for cross-functional or cross-insurer use cases, and careful data minimisation throughout. The economics favour increased investment in synthetic data as the technology matures, as regulatory clarity improves, and as the DPDP Act enforcement tightens.

Frequently Asked Questions

Is synthetic data considered personal data under the DPDP Act 2023?

Properly generated synthetic data is not personal data under the DPDP Act 2023. The Act defines personal data as data about an identifiable individual. Synthetic data that does not correspond to any real individual and that cannot be reverse-engineered to identify individuals in the training data falls outside the Act's scope. The Data Protection Board's draft rules published in late 2024 explicitly mention synthetic data as a form of anonymisation. However, the insurer remains responsible for verifying that the synthesis process does not leak personal data, through similarity distance checks against training records, membership inference testing, and where applicable differential privacy accounting with documented epsilon and delta parameters.

Which synthetic data technique should Indian insurers use for underwriting model training?

Tabular synthesis tools such as CTGAN and TVAE, available through the open-source Synthetic Data Vault library, are the starting point for most policy and claims data synthesis. These handle the mixed continuous and categorical columns typical of insurance tabular data and preserve conditional dependencies across fields. For rare-event modelling where historical data is insufficient, agent-based simulation supplements tabular synthesis by generating plausible tail events from first principles. For image and document data such as synthetic policy schedules or diagnostic reports used in fraud detection, diffusion models have largely replaced GANs due to better training stability and output quality. Most production deployments combine at least two techniques in a single pipeline.

How do Indian reinsurers view synthetic portfolio data for treaty placement?

Major reinsurers including Munich Re India, Swiss Re, SCOR, and Hannover Re accept synthetic portfolio data for preliminary treaty placement discussions, particularly where cross-border data transfers under the DPDP Act 2023 would otherwise add legal complexity. Final binding typically still requires real data under appropriate data sharing agreements. For facultative placements of unusual risks, synthetic data may be acceptable throughout the process if the methodology is documented and both parties accept the modelling approach. For catastrophe bonds and insurance-linked securities, investor due diligence usually requires real data, limiting synthetic data's role to earlier structuring stages. The practical advantage is removing DPDP Act friction from the preliminary cycle without changing the ultimate placement basis.

What are the typical costs of a synthetic data programme for an Indian general insurer?

A serious synthetic data programme at a mid-sized Indian insurer typically runs INR 3 to 10 crore annually. Platform licences for commercial tools like MOSTLY AI or Gretel range INR 1 to 5 crore. Open-source deployments cost INR 50 lakh to INR 2 crore in initial stand-up and maintenance. Compute costs for synthesis are INR 10 to 50 lakh annually for common use cases. Validation, including statistical testing and privacy verification, adds INR 30 to 80 lakh annually in engineering and actuarial time. The cost compares favourably to the INR 5 to 15 crore annual operating cost of a mature data privacy programme built on real data, and the synthesis approach removes DPDP Act burden from downstream use cases.

Do CSO-actuaries accept synthetic-data-trained models for regulatory reserving and pricing?

Synthetic data supplementation, where synthetic records extend a real dataset to improve training, is broadly accepted by Chief Statutory Actuaries provided the synthesis process is documented and the downstream model is validated on real held-out data. Pure synthetic training, where the model learns entirely from synthetic data, is viewed with more caution and is typically accepted only for prototyping or genuinely data-scarce use cases such as new lines of business and rare perils. Validation documentation should include the synthesis technique and hyperparameters, fidelity test results including downstream task performance, privacy verification results, and an assessment of residual risks. This documentation becomes part of the model validation file supporting actuarial certification.