Structured Data Modeling in the Insurance Space

By Josh Lewis, Head of Product, Newfront Insurance

Josh Lewis, Head of Product, Newfront Insurance

Ask an account manager to cross out the parts of ACORD 125 they don't find relevant and you'll get a lot of red ink. Unfortunately the ACORDs, along with IVANS downloads, are the closest we have to an industry standard data representation in the insurance space. Machine learning methods thrive in the presence of well-structured, high-quality data, and in retail insurance such data are in short supply.

At Newfront we are focused on how we can best represent the coverage offered by our carriers and operational information regarding our clients. We use this data to build a better experience for our clients when they transact and manage insurance.

Our concerns as a brokerage are quite different than those on the underwriting side, where data models are more mature and innovation comes from acquiring and using novel sources of data to make better underwriting decisions. Technical challenges in our domain include automatically generating a comprehensive summary of insurance for a client or determining the appropriate marketing strategy for a client based on their risk characteristics and historical carrier behavior.

We've broken down the world of insurance data into two large buckets: coverage provided by carriers, and the business operations of our insurers. In both cases we've built technology to help gather and structure this information.

Starting with coverage data, the source of truth for these data lies in the quote and policy documents generated by carriers. Typically account managers must read these documents in order to capture the coverage and pricing information they contain into an agency management system. We've built a tool that automatically parses PDF documents to extract structured coverage and pricing data, obviating the manual interpretation and data entry steps of quote and policy ingestion.

We debated two approaches to building this form of automation, a rule-based system and one based on natural language processing and machine learning methods. Our goal is to minimize the amount of time and effort our account managers need to put into reviewing the extracted data by maximizing our ability to estimate the extraction's accuracy. Rule-based systems perform well on samples that match their target representation (e.g., Hartford workers compensation quotes) and typically quite poorly on samples outside that target (a notable exception is when carriers share underlying quote templates). Machine learning systems do a better job generalizing outside their training corpus, but provide no guarantees on the accuracy of any given extraction.

"We need ontologies that accurately reflect all the possible coverages provided by carriers, and the operations of every business in every industry"

To put it another way, across an entire document corpus a rule-based approach may have only 40 percent total accuracy, but near 100 percent accuracy on the documents that it does extract a significant amount of data from. In contrast a machine learning approach could have 60 percent accuracy across the corpus but you'd have no guarantee of the accuracy of extraction from a given document. The former accuracy characteristics are preferable for us because we can separate our quotes and policies into two tracks: those that require little to no human review, and those that require full human review. With the lack of a strong accuracy guarantee in the latter case, we'd always need a significant human review component.

We enhance our data extraction process with validation. We know the data types of the fields we're trying to parse, and can flag if there's a type mismatch (alphabetic characters where we expect a dollar amount or a dollar amount where we expect a date). As for the rules themselves, our rules engine enables a variety of extraction methods, including regular expressions, spatial instructions (extract all text in this bounding box on page 2), and inline custom functions.

The second bucket of data we have worked to develop a representation for is operational information regarding our clients. Here the challenge is its breadth and diversity. As a restaurant you may be asked about the details of your UL300 suppression system maintenance and what percent blend of your sales are alcohol, whereas as a construction company you'll be asked about personal protective equipment and percentage of work subcontracted.

In the process of digitizing hundreds of carrier applications and supplemental we've built a database with tens of thousands of questions that carriers ask of our clients. This creates a deduplication problem — we don't want to separately represent the questions "What year did you incorporate your business?" and "Year of business incorporation."

To assist our application digitization team in avoiding such duplication, we've used NLP methods to detect and highlight potential dupes. We do common preprocessing (stemming, tokenization, and removing stop words) followed by TF-IDF to highlight existing questions similar to a given question wording. The digitization team then has the option to reuse the existing question rather than creating a new one specific to the application that they are working on.

One of the things that most excited me when I joined Newfront was the massive data representation challenge that insurance presents. We need ontologies that accurately reflect the operations of every business in every industry, and all the possible coverage provided by carriers. In building these ontologies we will unlock our ability to train sophisticated machine learning models on top of our newly structured data.

Weekly Brief

Top 10 Insurtech Startups - 2019

Read Also

Match Game 2.0

Match Game 2.0

Clint Roszelle, Director – Enterprise Process Excellence, Citizens Property Insurance Corporation
Capitalizing on Commercial Insurtech Disruption

Capitalizing on Commercial Insurtech Disruption

Marcus Knuth, Vice President – Enterprise Technology, Acuity Insurance
Making Wiser Choices Regarding Technological Implementations

Making Wiser Choices Regarding Technological Implementations

Jill Cook, Director, Structured Settlements Administration, Protective Life
Driving Value-Based Experiences In Insurance  Using Apis

Driving Value-Based Experiences In Insurance Using Apis

Leslie (Les) Hermitt Jr. Chief Digital Officer C&F Digital Partners business unit
Impact of Operational Risk on Asset Valuations and Pricing of Corporate Insurance Policies

Impact of Operational Risk on Asset Valuations and Pricing of Corporate Insurance Policies

Aleksandar Kovacevic, Founder and Managing Director, Audeamus Risk
InsurTech's hot streak continues

InsurTech's hot streak continues

Sam Evans, Founding Partner, Eos Venture Partners