The Indian SME Underwriting Data Problem
India has approximately 63 million micro, small, and medium enterprises registered under the MSME Development Act, contributing roughly 30% of GDP and employing over 110 million people. This sector is chronically underinsured. The MSME Insurance Council estimated in 2025 that fewer than 11% of registered MSMEs hold any commercial insurance beyond motor third-party liability, and fewer than 4% have fire or property coverage. The primary reason cited by insurers for low penetration is not price or distribution, but underwriting difficulty: the data required to assess SME risk simply does not exist in the structured, audited form that actuarial underwriting models assume.
A mid-size manufacturing firm with INR 15 crore in annual turnover applying for a fire-cum-machinery breakdown policy presents a fundamental underwriting challenge. If it is a partnership firm or a proprietorship, it is not required to file audited financials with any public authority. If it is a private limited company, its Registrar of Companies filing may be 12 to 18 months old and may be prepared to minimise tax liability rather than accurately represent the business's financial health. Its credit history with banks may be limited if it relies primarily on trade credit from suppliers. It may have no formal occupational health and safety records. The proposal form that an underwriter receives contains self-reported information across 30 to 40 fields, with limited verifiable cross-references.
Traditional actuarial approaches to SME underwriting cope with this data scarcity in two ways: they use broad risk classifications (hazard category, construction type, location) that can be assessed without the firm-level data that is unavailable, and they charge loading premiums that compensate for uncertainty. This approach produces premiums that are adequate in aggregate but mispriced at the individual firm level: safe, well-managed SMEs pay the same rates as poorly managed ones in the same risk category, and the safe firms either pay too much (if the loading is dominated by the risky firms) or go uninsured. Neither outcome helps the Indian commercial insurance market develop.
Large language models offer a different approach. They can read, interpret, and synthesise information from unstructured and semi-structured sources that traditional actuarial models cannot process: GST return narratives, MCA filings, online court records, business news, environmental compliance records from the Central Pollution Control Board, business credit reports from agencies such as CIBIL's commercial bureau, Equifax Commercial, and Experian India. By extracting signals from these sources and combining them with the available structured data, LLMs can construct a risk profile for an SME that is significantly richer than what is available from the proposal form alone.
What LLMs Can Extract from GST Returns, MCA Filings, News, and Court Records
The signal value of each data source for SME underwriting differs in type and quality. Understanding what each source can and cannot contribute is essential to calibrating the LLM's output appropriately.
GST returns (Form GSTR-1, GSTR-3B, GSTR-9) are available for all GST-registered businesses through the GST portal, and with the business's consent, an insurer or its authorised analytics partner can access the filing history through the AA (Account Aggregator) framework or through consent-based API access. GST return data reveals several insurance-relevant signals: revenue trend over the past 24 to 36 months (which the LLM can extract and interpolate from quarterly GSTR-9 data), the proportion of sales to identifiable customers versus anonymous buyers (a proxy for business stability), the geographic distribution of supply (a proxy for logistics exposure), and whether filing has been consistent or irregular (irregular filing is associated with financial stress). An LLM applied to 24 months of GST returns can produce a revenue trend narrative, flag material year-on-year declines, and identify seasonal patterns that indicate the nature of the business's operations.
MCA21 filings from the Ministry of Corporate Affairs are publicly available for all companies registered under the Companies Act 2013. For private limited companies, MCA21 filings include the annual return (MGT-7), the balance sheet and profit and loss account (AOC-4), the directors' report, and any charges created on assets. An LLM can read these documents, extract financial ratios (debt-to-equity, current ratio, return on assets) from the balance sheet, identify pledged or mortgaged assets that affect the insurable interest calculation, flag director changes that might indicate management instability, and identify whether the company has received any regulatory notices. The limitation is timeliness: MCA21 filings are typically 12 to 18 months old by the time they are available, which limits their value for fast-changing businesses.
Court record databases (e-Courts, NCLT, NCLAT, district court portals) contain information about litigation involving the business: civil suits, consumer complaints, labour disputes, environmental enforcement actions, and insolvency applications. For underwriting purposes, pending litigation is relevant because it affects the insured's financial position (a large judgment against the firm could reduce its ability to maintain the property in good condition), indicates the nature of disputes the business is involved in (a consumer company with frequent product liability suits is a different risk than one without), and can signal management quality. An LLM can scan court records for the company name and its directors, extract the nature and status of pending cases, and flag material litigation for underwriter attention.
News and web sources provide real-time signals not available in regulatory filings. A factory fire at an SME premises that was reported in a local newspaper but not in any regulatory filing can be surfaced by an LLM that monitors news archives. A labour strike, an environmental violation reported by an activist group, a recall of products mentioned in a trade publication, or an adverse finding in a GST audit that leaked to the press can all be material to underwriting and are available in unstructured text but not in structured databases. The challenge with news sources is noise: an LLM must distinguish between material adverse information and irrelevant mentions, and must handle ambiguous entity resolution (multiple businesses with similar names operating in the same geographic area).
Synthesis and the role of the LLM
The LLM's distinct contribution is synthesis. Traditional rule-based systems can extract individual data points from these sources. The LLM can read across multiple sources, resolve apparent contradictions (the company's MCA filing shows healthy revenue but its GST filing shows a 40% decline in the following year), construct a narrative of the business's trajectory, and produce a risk assessment that is grounded in specific cited evidence rather than a black-box score. This narrative output is genuinely useful to the underwriter: it can be reviewed, challenged, and corrected in a way that a score cannot.
Accuracy vs. Traditional Actuarial Approaches
The question of whether LLM-based SME risk assessment is more accurate than traditional actuarial approaches does not have a simple answer, because accuracy depends on the prediction target, the quality of training data, and the baseline comparison. In the Indian SME context, the honest answer in 2026 is that the evidence is promising but limited, with most validation data coming from closed sandbox environments rather than from large-scale production deployments.
Traditional SME underwriting in India operates primarily on risk classification parameters that can be observed without firm-specific data: the hazard class of the manufacturing process (as defined in Schedule I and II of the Factories Act 1948), the construction type of the building (pucca, semi-pucca, kutcha, or fire-resistant construction), the sum insured, and the geographic location for natural peril exposure. These parameters are genuinely predictive of claims frequency at the portfolio level, but their predictive power at the individual firm level is limited: a well-managed firm in a high-hazard class can have excellent loss experience, while a poorly managed firm in a low-hazard class can have frequent claims.
LLM-based approaches introduce firm-level information that traditional models lack. The revenue trend from GST returns is associated with financial stability, which is associated with maintenance quality and claims frequency. Pending litigation from court records has predictive value for intentional or fraudulent claims. Management changes from MCA filings are associated with operational disruption periods where safety standards may slip. When these signals are combined with traditional classification parameters, the combined model outperforms the classification-only model on the primary prediction targets in sandbox validation data.
The IRDAI Regulatory Sandbox cohort of 2024-25 included two SME underwriting AI applications that published summary outcome data. Application S-2024-18, from an insurer using an LLM-based risk scoring model on SME fire policies, reported that the LLM-augmented model reduced the Gini coefficient of loss ratio prediction (a measure of discriminatory power) from 0.18 to 0.31 on its test dataset, meaning the model could distinguish high-risk from low-risk SMEs significantly better than the traditional classification approach. Application S-2024-23, from a different insurer using a similar approach on SME commercial package policies, reported a 22% reduction in loss ratio variance on the LLM-scored portfolio relative to the unscored portfolio, suggesting that better selection of risks (and pricing above the model's risk score for risks that exceed the threshold) was producing better underwriting outcomes.
These results must be interpreted carefully. Both sandbox applications operated on proprietary data from a single insurer's portfolio, with limited ability to generalise to other insurers' risk mixes. The test periods were relatively short (12 to 18 months), too short to observe the full tail of losses on the scored cohort. And both applications used consent-based data access for GST and MCA information, which means the scored population was biased toward firms willing to share data, which are systematically different from the average SME. The results are encouraging, but not yet definitive evidence that LLM-based SME scoring works at scale across the Indian MSME sector.
MSME-Sector Bias Risks
The bias risks in LLM-based SME underwriting are more complex than the bias risks in other insurance AI applications. The primary concern in consumer insurance AI (motor, health) is demographic bias: models trained on historical data may encode the systematically worse outcomes of protected groups and perpetuate discrimination. In SME underwriting, the relevant biases operate at the level of business characteristics that are correlated with geographic, sectoral, and community patterns.
Geographic bias is the most acute risk. SMEs in Tier 2 and Tier 3 cities have systematically less digital footprint than urban businesses. Their MCA filings may be less complete, their GST records more likely to have gaps, and their news mentions essentially nonexistent. An LLM that uses completeness of digital footprint as a risk signal will systematically score urban, well-documented SMEs as lower risk than rural, less-documented ones, even if the underlying operational risk is similar. This is not a failure of the LLM per se but a failure of the data infrastructure that the LLM relies on. The practical consequence is that LLM-based scoring may increase insurance access for urban SMEs while reducing it (through higher pricing or more declines) for rural ones.
Sectoral bias operates through the training data. If the LLM's training data (or the historical loss data used to calibrate its risk scores) overrepresents certain sectors and underrepresents others, the model's risk assessments will be more reliable for well-represented sectors. Textile and garment manufacturing, food processing, and retail are well-represented in Indian loss databases because they have been insured at scale for decades. Emerging sectors like electric vehicle component manufacturing, drone assembly, or green hydrogen equipment maintenance have thin historical data, and an LLM scoring an SME in these sectors will be extrapolating further from its training distribution.
Community bias is the most sensitive risk. Certain geographic clusters of SMEs in India are associated with specific community ownership patterns: the diamond industry in Surat, the textile industry in Bhiwandi, the gem-cutting industry in Jaipur. If historical loss data reflects worse outcomes in these clusters for reasons related to business practices, infrastructure quality, or claims management patterns, the LLM will learn these cluster-level signals. Applying them to individual firms within the cluster constitutes discrimination on the basis of community membership rather than individual risk characteristics. IRDAI's emerging fairness guidelines, informed by the sandbox consultation, specifically identify geographic cluster signals as requiring careful audit before deployment in scoring models.
Mitigation approaches include bias testing against protected and quasi-protected attributes (geographic cluster, sectoral cluster, director demographics where available) before production deployment, including fairness metrics alongside accuracy metrics in model validation, and designing the LLM's prompting and extraction pipeline to require that each risk signal be grounded in a specific factual cite rather than a generalisation about the business category. Requiring specific evidence also has an explainability benefit: the underwriter can assess whether the cited evidence is relevant and proportionate to the weight given to it.
Sarvada Intelligence and Riskcovry Implementations
Two Indian insurtechs have been publicly associated with LLM-based SME risk assessment, offering different models for how the technology reaches the underwriting market.
Sarvada Intelligence operates as a commercial underwriting intelligence platform focused on the Indian market. Its SME underwriting product, which entered IRDAI sandbox trials in late 2024, uses a multi-source LLM pipeline that aggregates data from GST filing history (via Account Aggregator-enabled consent flows), MCA21 public filings, e-Court records, CIBIL Commercial bureau data, and web news sources to produce an SME risk narrative for commercial insurance underwriting. The platform's output is not a score but a structured risk assessment report: a narrative summary of the business's trajectory and risk factors, a list of identified signals with their source citations, a list of open questions for the underwriter to follow up with the proposer, and a preliminary premium band recommendation.
The Sarvada Intelligence approach reflects a design philosophy that LLM outputs in underwriting should augment human judgment rather than replace it. The risk narrative is explicitly designed to be reviewed by an underwriter, who can accept, modify, or reject the preliminary assessment. This positions the LLM as an information synthesis tool that reduces the underwriter's research burden, rather than as an autonomous pricing engine. The IRDAI sandbox trial covered 400 SME fire and package policies from three participating insurers. Early results, as described in the sandbox summary report published in February 2026, showed that underwriters using the Sarvada Intelligence reports spent 45% less time on initial risk assessment and identified 30% more material risk factors (those that led to premium adjustments or additional conditions) than underwriters working from proposal forms alone.
Riskcovry focuses on distribution workflows rather than underwriting intelligence, but has incorporated LLM-based risk assessment as a component of its SME distribution platform. Riskcovry's platform assists insurance brokers in serving SME clients through digital workflows; the LLM component analyses the business profile assembled during the quotation process and generates a risk summary that helps the broker recommend appropriate covers and identify potential gaps. The LLM in Riskcovry's context is oriented toward brokers and SME clients rather than underwriters, translating risk factors into plain-language explanations of why certain covers are recommended.
The two implementations illustrate the different roles LLMs can play in the SME insurance value chain: Sarvada Intelligence targets the underwriting decision, while Riskcovry targets the distribution and advisory stage. Both approaches address the data problem, but from different angles, and the two can be complementary: a broker using Riskcovry's platform to structure the SME risk profile, followed by the insurer's underwriter using Sarvada Intelligence's assessment, covers both the distribution-facing and the underwriting-facing information gaps.
IRDAI Sandbox Results for AI-Based SME Underwriting
IRDAI's Regulatory Sandbox framework has become the primary route for testing AI-based SME underwriting approaches in India. The Sandbox, established under the IRDAI (Regulatory Sandbox) Regulations 2019 and updated in 2024, allows insurers and insurtechs to apply AI approaches that are not explicitly permitted under existing regulations in a controlled environment, with structured outcome reporting to IRDAI and limited regulatory relaxation during the trial period.
The 2024-25 sandbox cohort included five applications specifically related to AI-based underwriting for SME commercial lines, more than any previous cohort. The applications covered approaches ranging from rule-based risk classification augmented with external data, to full LLM-based risk narrative generation, to hybrid approaches that used ML scoring with LLM-generated explanation summaries.
The aggregate findings, summarised in IRDAI's Sandbox Outcome Report published in March 2026, identified four consistent themes across the five applications:
First, data consent and access was the primary implementation challenge, not the AI model itself. Obtaining SME consent for GST data access through the Account Aggregator framework took an average of 6.3 days per application, compared to the underwriters' expectation of same-day data availability. Many SMEs were unfamiliar with the AA consent process, and drop-off rates during consent were high. The applications that built a guided consent flow into the proposer-facing UI had significantly lower drop-off rates than those that sent consent requests as standalone links.
Second, LLM extraction accuracy on MCA filings was high for well-formatted filings from professional secretarial firms but degraded significantly for filings prepared by the company itself without professional assistance. OCR errors in scanned MCA documents, non-standard table formats in financial statements, and missing fields in older filings reduced extraction accuracy. The applications that used a human quality-check layer on extracted financial ratios before feeding them to the scoring model outperformed those that used raw LLM extraction.
Third, portfolio loss data for validation was inadequate in every application. The 12 to 18 month sandbox observation period is too short to validate loss predictions for commercial lines, where the loss-generating events (fires, machinery breakdowns) may occur once every 5 to 10 years at the individual firm level. All five applications relied on backtesting against historical loss data as the primary validation approach, which is subject to survivorship bias and data quality limitations in the insurers' historical records.
Fourth, the premium benefit to SMEs from improved risk selection was observed in two of the five applications: SMEs with stronger risk profiles (positive revenue trends, no adverse court records, good compliance history) received premiums 15 to 20% below the standard rating, while high-risk-signal SMEs received loadings of 25 to 35%. In the other three applications, the premium differentiation was more modest or was not passed to SMEs because the insurers used the improved risk selection for portfolio management rather than individual pricing.
IRDAI's post-sandbox guidance, expected in Q3 2026, is likely to set specific requirements for data consent processes, model validation periods, and explanation documentation for AI-based SME underwriting. The guidance is also expected to address the bias audit requirement, reflecting the community and geographic bias risks identified during the sandbox review.
Building an LLM SME Risk Assessment Workflow That Works
Translating the technology potential of LLM-based SME risk assessment into a production underwriting workflow requires careful design across data, model, and process dimensions. The following patterns reflect the practices that worked best in the IRDAI sandbox cohort and in the early production deployments described above.
Data access architecture must be built on explicit consent. The DPDP Act 2023 and IRDAI's data protection requirements prohibit using personal and business data for underwriting without a valid consent basis. For GST data, the RBI's Account Aggregator framework (Sahamati) is the appropriate consent mechanism; for credit bureau data, bureau-specific consent APIs are available. Building the consent flow into the quotation journey, where the SME owner completes a short digital consent process that connects their GST portal and AA data to the insurer's system, is more effective than requesting consent as a separate step after the quotation has begun.
Prompting strategy for the LLM must require source citation for every extracted fact. Prompts that instruct the model to 'summarise the business's financial position' without requiring specific citations from the documents produce fluent summaries that may contain hallucinated details. Prompts that instruct the model to 'extract the reported revenue for each quarter in the GST-3B filings provided, with the document name and date for each figure, and compute the year-on-year trend' produce outputs that can be verified against the source documents. Source-cited extraction is more laborious to prompt and more expensive in inference tokens, but the output is substantially more reliable for underwriting use.
Human oversight checkpoints should be placed at the data extraction stage, not only at the final assessment stage. If the LLM extracts incorrect financial figures from an MCA filing, and those figures flow into the risk score without a human check, the downstream risk narrative and premium recommendation will be based on incorrect inputs. A lightweight human quality check on extracted numeric data, particularly balance sheet figures and revenue numbers, catches extraction errors before they propagate. This check adds 10 to 15 minutes of human time per application but prevents the larger downstream cost of a mis-priced risk.
Output format should preserve the evidence trail. The LLM's output should be a structured report that distinguishes between facts (with source citations), inferences from facts (clearly labelled as inferences), and open questions (items where available data is ambiguous or incomplete). The underwriter should be able to see, for each risk signal, where it came from and how material the LLM judges it to be. This structure enables the underwriter to review and challenge the LLM's analysis, and it produces the audit trail required by IRDAI's Information Security Guidelines 2023 for AI-assisted underwriting decisions.
Feedback loops from underwriter decisions back to the LLM system are essential for model improvement. When an underwriter disagrees with the LLM's assessment and records the reason for their override, this feedback should be captured and used in the next model evaluation cycle. Over time, systematic override patterns reveal where the LLM is systematically wrong (specific industries, specific financial patterns, specific court record types) and guide prompt and model improvements.