Operations & Best Practices

Claims Data Warehouse for Broker Operations in India: 2026 Build Guide

A practitioner's guide to designing, building, and operating a claims data warehouse inside an Indian commercial insurance broker firm, covering source-system extraction, milestone schema, cohort tagging, governance, and the reporting layer that turns raw claim records into renewal and panel decisions.

Sarvada Editorial TeamInsurance Intelligence
15 min read
claims-databroker-operationsdata-warehouseanalyticsirdaiclaims-management

Last reviewed: May 2026

Why Brokers Need a Claims Data Warehouse in 2026

Indian commercial brokers cannot run service-quality benchmarks, renewal analytics, or insurer panel decisions on email threads and spreadsheets. By 2026, mid-market and listed-client risk managers expect their brokers to bring structured claim-level data to every renewal review, with milestone timestamps, surveyor identification, settlement quanta, and cohort tags ready to be sliced. Firms that cannot produce this view lose mandates to firms that can.

The trigger is partly regulatory and partly competitive. The IRDAI master circular on protection of policyholders' interests (revised through 2024) prescribes specific claim-handling timelines that brokers must monitor on behalf of clients. The IRDAI Information and Cybersecurity Guidelines 2023 require structured logging of policy-data access, which presumes that broker firms keep claim records in a managed system rather than mailbox archives. The IRDAI (Insurance Brokers) Regulations, 2018 (as amended through 2024) require brokers to maintain claim files for a minimum of ten years, with auditable access trails.

At the same time, the commercial brokerage market is consolidating. Larger broking firms with structured data operations win mandates from mid-market and listed clients on the strength of their service reporting; smaller firms that depend on relationship and intuition find themselves out-positioned at every renewal cycle. A claims data warehouse is not optional infrastructure; it is the spine on which broker service value is built.

This guide walks through the practical design choices: what to capture, where to source it from, how to model it, how to govern it, and what reporting layer turns it into decisions. It assumes an Indian commercial broker between INR 25 crore and INR 250 crore annual revenue, with 8 to 20 insurer relationships and a claims book of 1,500 to 15,000 claims per year across property, marine, engineering, liability, health, and motor lines.

What the Warehouse Must Capture: Core Entities and Milestones

A claims data warehouse for a commercial broker is built around five core entities: client, policy, claim, claim event, and party. Each is captured with structured fields and timestamps.

The client entity holds the policyholder master record: legal name, CIN where applicable, PAN, registered address, industry classification (NIC code), and relationship-management metadata (account manager, segment, revenue band). One client may hold multiple policies across multiple insurers.

The policy entity holds policy-level data: policy number, insurer, line of business, sum insured, premium, period of insurance, broker code with the insurer, co-insurance arrangement, deductible structure, key wording references, and any layered or quota-share structure. A claim attaches to one policy, though large losses may trigger claims across multiple policies (property plus business interruption, marine cargo plus open cover endorsement).

The claim entity holds claim-level data: claim number with the insurer, broker internal reference, line of business, cause of loss, location of loss, claimed amount, reserve estimate at notification, current reserve, paid-to-date, final settled amount, status (open, partially settled, closed, reopened), and cohort tags (size band, complexity tier, geographic tier).

The claim event entity is where the warehouse earns its keep. Each event in the claim lifecycle is logged as a separate record with a timestamp: date of loss, FNOL date, insurer acknowledgement date, surveyor appointment date, first site visit date, preliminary report submission date, documentation requests, documentation receipts, final surveyor report date, settlement offer date, policyholder acceptance date, payment release date, claim closure date, and any reopening events. The event table is the source of every timeline metric the broker computes.

The party entity holds individuals and firms involved with the claim: surveyor (IRDAI licence number), loss adjuster, legal counsel, technical consultant, TPA where applicable, insurer claims handler, reinsurer where the claim touches a fronted or excess layer. Parties link to events so the warehouse can attribute delays and surface surveyor-level performance.

Minimum field discipline at FNOL

The single most important capture point is FNOL. If the FNOL record is incomplete, every downstream metric is degraded. Brokers should enforce a minimum FNOL field set: client, policy reference, date of loss, location of loss, cause of loss, preliminary estimate, and reporting channel. Account managers who skip FNOL fields and promise to backfill later create the data debt that destroys benchmark credibility two years on.

Where the Data Comes From: Source Systems and Ingestion

Most Indian commercial brokers run on a heterogeneous mix of source systems. A typical 2026 firm has a placement system or CRM (in-house or third-party), an email-driven correspondence flow with insurers, an accounting system for brokerage and premium reconciliation, and a folder-based document store. None of these on its own holds a clean claim dataset.

Five ingestion pipelines feed the warehouse.

  1. Broker-side case management system. The primary source of broker-controlled events: FNOL captured by the account manager, document collection from the client, internal claim handler notes, broker correspondence logs.
  2. Insurer claim portals and APIs. Larger insurers (ICICI Lombard, HDFC ERGO, Bajaj Allianz, Tata AIG, SBI General) expose broker claim portals with structured fields for claim status, surveyor appointment, reserve movement, and settlement decisions. A growing minority offer API access, which the warehouse should consume directly.
  3. Email and document parsing. For insurers without portal access, broker firms still receive surveyor reports, settlement offers, and payment intimations by email. A parsing layer extracts structured fields from these inbound messages and reconciles them against existing claim records.
  4. TPA portals for health claims. Group health claims flow through TPAs (MDIndia, FHPL, MediAssist, Vidal Health, and others). TPA portals expose cashless approval timelines, hospital-side documentation, and reimbursement decisions that must reconcile back to the broker's claim master.
  5. Accounting system. Premium and brokerage flows reconcile against claim payments to ensure that recoveries (deductible recoveries, salvage credits, subrogation) are captured against the originating claim.

The ingestion architecture should be idempotent: re-running an extract should not create duplicate records. Each source system is given a stable claim-key mapping (typically insurer claim number plus insurer code) so that updates flow into the same warehouse record on each refresh.

For smaller firms, a daily batch refresh from each source is adequate. For larger firms running real-time client dashboards, an event-streaming architecture with hourly increments and change-data-capture on the case-management system is more appropriate, though the operational complexity is meaningful. Most Indian broker firms below INR 100 crore revenue should start with daily batch and add streaming only when client-facing requirements demand it.

Email parsing realism

Email parsing accuracy in the Indian market hovers around 70 to 85 percent for well-templated insurer correspondence and drops sharply for free-form messages. A practical approach is to use the parsed fields as a draft, route exceptions to a human reviewer, and treat the human-confirmed record as authoritative. Generative-AI-based extraction has improved through 2025 to 2026 but still requires human-in-the-loop review for material claims.

Data Model Choices: Star Schema, Slowly Changing Dimensions, and Cohort Tags

The warehouse should be modelled as a star schema with a central fact table (claim events) and dimension tables (client, policy, insurer, surveyor, geography, time, cohort).

The fact table holds one row per event, with foreign keys to dimensions and the event timestamp. Aggregations (median FNOL-to-appointment time by insurer by quarter) are computed off the fact table with standard SQL or a BI tool. The fact table grows linearly with claim volume; a firm with 5,000 claims annually and 20 events per claim generates 100,000 event rows per year, which is trivial for any modern warehouse.

The policy and client dimensions are slowly changing. A client may change its industry classification mid-year, a policy may be endorsed to add a location, an insurer may change its broker code mapping. Slowly Changing Dimension Type 2 (SCD2) is the standard approach: each change creates a new dimension row with effective-from and effective-to dates, and fact rows link to the dimension version that was active at the event timestamp. SCD2 adds operational complexity but is essential when the warehouse is used for multi-year trend analysis.

The insurer dimension is small (20 to 30 insurers in the Indian non-life market) but consequential. It should hold insurer name, ownership type (PSU, private Indian, private foreign-JV, GIC), claims operations centre location, broker relationship manager, current panel status (active, restricted, suspended), and any line-specific notes.

The surveyor dimension is larger (several thousand IRDAI-licensed surveyors active in the Indian market) and is the basis for surveyor-level performance analytics. It should hold surveyor licence number, name, firm, licensed lines, primary geography, and an active-relationship flag.

The cohort dimension is the broker's editorial overlay. It holds cohort tags: claim size band, complexity tier (straight, moderate, complex), geographic tier (metro, Tier-1 cluster, Tier-2 cluster, remote), and any custom segments the broker firm uses for its own analytics. Cohort tagging is partly automated (size band from claim amount) and partly judgmental (complexity tier requires human classification).

Cohort tagging discipline

The cohort tags are where most data warehouses degrade over time. The temptation to skip the complexity tier on routine claims, or to leave geographic tier defaulted to metro, accumulates into a tagging dataset too inconsistent to support cohort-adjusted benchmarking. The fix is to make the tags mandatory at claim closure rather than at claim opening, with a senior claims-handler review of the tag before the claim is closed. This shifts the tagging effort to a point in the workflow where the claim handler has full information, and adds a quality-control checkpoint.

Governance, Access Controls, and Regulatory Alignment

A claims warehouse holds policyholder personal data, claim financials, and insurer-confidential information. Governance is not optional.

The Digital Personal Data Protection Act, 2023 (DPDP Act) treats claim data containing individual policyholder information as personal data, with implications for consent, data minimisation, and breach notification. For commercial claims involving individual claimants (employee injury under workers' compensation, third-party motor claimants, key-person directors' and officers' claims), the broker is a data fiduciary or processor depending on contractual structure with the insurer. The warehouse design should support data-subject requests (access, correction, erasure where permissible), retention policy enforcement, and breach-incident reporting.

The IRDAI Information and Cybersecurity Guidelines 2023 require structured access logging, role-based access controls, encryption of personal data at rest and in transit, and incident reporting timelines. The warehouse must implement these controls explicitly. Access logs should be immutable and retained for at least three years.

Access control inside the broker firm follows the principle of least privilege. Account managers see clients they handle; claims handlers see claims they work on; firm leadership sees aggregated views with on-demand drill-down protected by audit logging. Cross-client analytics (insurer benchmarks, surveyor scorecards) should be exposed only to firm-level roles, not to client-facing staff who may inadvertently leak competitive data.

The IRDAI (Insurance Brokers) Regulations, 2018 require claim files retention for ten years from claim closure. The warehouse retention policy must align: events older than ten years can be archived to cold storage with a documented restore procedure, but should not be deleted while the underlying obligation persists.

External data sharing

Brokers occasionally share aggregated benchmark data with insurers and selected clients. This sharing must be governed by a written data-sharing protocol covering aggregation thresholds (no insurer-level data revealing fewer than 10 underlying claims, no client-level data at all in cross-firm exchanges), anonymisation standards, and approval workflows. Informal sharing of insurer scorecards in WhatsApp groups or unprotected email is a common source of relationship damage and, depending on content, a DPDP Act risk.

The Reporting Layer: From Warehouse to Decisions

A warehouse without a reporting layer is shelfware. The reporting layer translates structured data into the recurring decisions that broker firms make: panel management, renewal pricing, client reviews, insurer negotiations, and capacity planning.

Five standard reports should be produced on a fixed cadence.

  1. Insurer service scorecard. Quarterly, by line of business and cohort. Shows the four timeline metrics (FNOL to appointment, appointment to preliminary, preliminary to final, final to payment) for each insurer, with quartile placement and trend versus the prior quarter. This is the spine of panel-management decisions.
  2. Client claims review. Per client, per renewal cycle (typically annual). Shows the client's claim experience over the prior 24 months: count, severity, root cause analysis, insurer responsiveness, and outstanding reserves. Forms the evidence base for the renewal conversation.
  3. Surveyor performance report. Quarterly, by line and geography. Shows median time-to-preliminary, time-to-final, and rework rate for each active surveyor. Used to refine the broker's preferred-surveyor recommendation list and to flag surveyors for performance discussions.
  4. Loss-cause analytics. Quarterly, aggregated across the broker book. Shows the leading causes of loss by line and industry, supporting client-facing risk-engineering recommendations and underwriting submission narratives.
  5. Capacity-and-pricing dashboard. Monthly, for firm leadership. Shows premium written by insurer by line, loss ratios on the broker book, and pricing trends, supporting strategic conversations about insurer relationships and competitive positioning.

Reports should be parameterised. The same insurer scorecard template should run for a single client (showing how insurers performed on that client's claims) or for the full broker book (showing market-wide insurer performance). Parameterisation lets a single report template serve account managers, claims handlers, and firm leadership without duplicating effort.

Self-service versus curated

Larger broker firms inevitably face requests for self-service analytics: 'give me a tool where I can slice claims by anything.' The honest answer is that self-service is hard to do well with claim data, because the cohort definitions and complexity tags require domain context that a generic BI tool does not enforce. A workable middle path is to publish a small set of curated reports as the canonical source of truth, and to allow self-service exploration in a sandbox environment that is clearly labelled as not for client-facing use. This protects the firm against contradictory numbers showing up in different client meetings.

Build, Buy, or Rent: Technology Choices for 2026

The technology stack for a claims data warehouse in 2026 has converged onto a small set of viable patterns.

For small to mid-size brokers (annual revenue INR 25 crore to INR 100 crore), the practical choice is a managed cloud data warehouse (Snowflake, Google BigQuery, Amazon Redshift, or Azure Synapse) paired with a managed ingestion tool (Fivetran, Airbyte, or a low-code orchestrator) and a BI layer (Looker, Power BI, Tableau, or Metabase). Total infrastructure cost runs INR 15 lakh to INR 60 lakh per year depending on volume. Implementation timeline is 4 to 9 months for a competent in-house data team or a specialist consultant engagement.

For larger firms (INR 100 crore plus revenue), the same stack scales, but the build investment increases and most firms add a customer data platform layer for client-facing dashboards. Total annual run cost can reach INR 1 crore to INR 3 crore including platform fees, integrations, and dedicated data engineering staff.

A growing alternative is to subscribe to a broker-focused SaaS platform that includes claims case management, document workflow, and reporting in a single product. Several Indian insurtech vendors and at least two international platforms now serve the Indian commercial broker segment. The subscription model removes the build complexity but constrains the firm to the vendor's data model and reporting choices; cohort tagging, custom segmentation, and insurer-specific scorecards may be limited or require additional work.

The build-versus-buy decision should be guided by three questions. First, how differentiated is the firm's claims-handling proposition; if claims service is a core differentiator, a custom warehouse with tailored reporting is justified. Second, how strong is the firm's data engineering capability; firms without in-house engineering should buy or partner rather than build. Third, what is the timeline; a buy can be live in 8 to 12 weeks while a build typically takes 6 to 12 months.

A realistic 2026 stack

For a typical mid-market broker building in 2026, a defensible stack is: Snowflake or BigQuery as the warehouse, Fivetran or Airbyte for managed extracts where source connectors exist, a custom Python or Node pipeline for email parsing and insurer-portal scraping, dbt for transformation and modelling, Looker or Power BI for the reporting layer, and a small data engineering team of 2 to 4 people supporting the entire stack. Cloud spend lands around INR 25 lakh to INR 50 lakh annually for a 5,000-claim book.

Operating the Warehouse: People, Cadence, and Quality Control

The warehouse needs operators, not just engineers. Three roles staff the operating model.

A data steward (typically a senior operations or claims-leadership person) owns the data definitions, cohort tagging standards, and quarterly benchmark methodology. This role makes judgment calls about classification, handles disputes with insurers about benchmark figures, and signs off on quarterly reports before they are distributed.

A data engineering function (in-house or vendor) owns the ingestion pipelines, transformation logic, and infrastructure reliability. This function fixes broken extracts, monitors data freshness, and implements schema changes. Even at smaller firms, a half-time data engineer is the minimum sustainable staffing.

A claims operations team owns the input quality. Account managers and claims handlers update the case-management system as events occur, attach documents to claim records, and tag claims at closure. Without this discipline, every downstream report degrades. Firms should treat claims-tracker hygiene as a measurable KPI for the team, with monthly reviews of completeness, timeliness, and field-level error rates.

Quality control runs on a quarterly cycle. Each quarter, the data steward should sample 50 to 100 claims at random and reconcile the warehouse record against the underlying email trail, insurer portal status, and accounting record. The reconciliation surfaces field-level errors, missing events, and cohort tag inconsistencies. A reconciliation report goes to firm leadership with the metrics: percentage of sampled claims fully accurate, percentage with minor errors, percentage with material errors. Targets should improve quarter over quarter as the discipline matures; a healthy warehouse achieves 95 percent or better full accuracy within two years of go-live.

When the warehouse fails

Warehouses fail for predictable reasons: insurer changes a portal layout and parsing breaks, an account manager leaves and her claim handovers are incomplete, a new insurer relationship is onboarded but its claim numbering convention is not mapped, or a refresh job silently fails for a week. The failure modes are not exotic; what matters is whether the operating model surfaces them quickly. Two practical instruments help. First, freshness monitoring: dashboards should flag any source whose latest event is older than 48 hours. Second, count-based anomaly detection: if the weekly count of new FNOL events drops by more than 30 percent versus the trailing 12-week average, an alert routes to the data steward for investigation. These are simple controls but catch the majority of operational degradations before they reach client-facing reports.

Frequently Asked Questions

What is the minimum data volume that justifies building a claims data warehouse?
A broker firm handling fewer than 500 claims annually across 4 to 6 insurers can usually operate from a well-maintained spreadsheet and a structured case-management system without a warehouse. Above 1,000 claims per year with 8 or more insurer relationships, spreadsheet maintenance breaks down and a warehouse becomes the lower-effort option. The decision is not only volume but analytics ambition: a firm that wants to publish quarterly insurer benchmarks, client renewal analytics, or surveyor scorecards needs a warehouse regardless of size, because spreadsheets cannot reliably produce these views with cohort segmentation.
How long does it take to build a usable claims data warehouse?
For a mid-market Indian broker with a competent vendor or in-house data team, a minimum viable warehouse with one source (case management), basic claim and event schema, and three reports can go live in 8 to 12 weeks. A more complete build covering multiple insurer portals, email parsing, TPA reconciliation, and the full reporting suite takes 6 to 12 months. Most firms underestimate the operational change-management effort: persuading account managers and claims handlers to update the case-management system as events occur is harder than building the warehouse itself, and the firm should plan dedicated training and KPI work for at least the first year.
How should brokers handle insurer-confidential data such as surveyor scorecards in the warehouse?
Surveyor and insurer scorecards are competitively sensitive and should be treated as confidential operational information. Access should be restricted to firm leadership and the data-steward role, with audit logging on every view. Aggregated insurer benchmarks shared with clients should suppress data points based on fewer than 10 underlying claims to protect individual insurer or surveyor identification. Sharing with insurers themselves (for example, returning their own performance data) should follow a documented protocol with named approvers. Casual circulation in WhatsApp groups or unprotected email is a common source of relationship damage and a DPDP Act risk where personal data is involved.
How does the DPDP Act affect claims warehouse design for commercial brokers?
The Digital Personal Data Protection Act treats claim data containing individual policyholder information as personal data. For commercial claims, the broker is typically a data processor acting on the insurer's instructions, though the contractual structure should be confirmed in each case. Warehouse design must support data-subject requests (access, correction, erasure where permissible), retention policy enforcement aligned with the IRDAI ten-year minimum, encryption at rest and in transit, role-based access control, and immutable access logging. Brokers should document a data protection impact assessment for the warehouse build and refresh it when material changes are made, and they should review their insurer data-processing agreements to confirm that warehouse-based processing is contemplated.
Should brokers use a SaaS broker platform or build a custom warehouse?
The choice depends on differentiation, capability, and timeline. If claims service is a core differentiator and the firm has data engineering capability, a custom warehouse with tailored cohort tagging and bespoke insurer scorecards is justified and produces analytics that vendor platforms cannot match. If the firm wants to be live quickly with a working operating model, lacks data engineering depth, and can accept a standardised reporting set, a SaaS broker platform reduces the build risk and shortens the timeline to 8 to 12 weeks. Several Indian and international vendors now serve the commercial broker segment, and the right choice is the one that aligns with the firm's strategic positioning rather than the lowest sticker price.

Related Glossary Terms

Related Insurance Types

Related Industries

Related Articles

Sarvada

Ready to see Sarvada in action?

Explore the platform workflow or start a product conversation with our underwriting automation team.

Explore the platform