Risk & Tabular ML

Automobile Insurance Claims Risk Modeling

Predicting motor-insurance claim frequency on 678K French policies — with a decision tree, neural network, and PCA implemented from scratch and benchmarked against a Negative Binomial GLM.

A machine-learning study on the French motor third-party liability dataset (~678,000 policies) that predicts how many claims a policy will file in a year. A decision tree, a feed-forward neural network, and PCA were implemented from scratch in NumPy, validated against scikit-learn and PyTorch references, and benchmarked against a Negative Binomial GLM — the actuarially natural model for over-dispersed count data.

Overview

The business goal is to estimate the expected number of claims per policy per year so an insurer can price risk fairly — charging more for high-risk drivers and less for low-risk ones. The target is `ClaimNb` normalized by `Exposure` (claims per year).

The project is deliberately comparative: models hand-built from scratch sit next to library references and a classical statistical model, scored on the same split with the same metrics.

Claim data is heavily zero-inflated — the vast majority of policies never file a claim — so the work treats metric choice and distributional assumptions as central rather than incidental.

Problem

The dataset covers ~678,000 policies with 11 driver and vehicle features: driver age, bonus-malus, vehicle power/age/brand/fuel, region, population density, and exposure.

Claims are rare and over-dispersed, so plain accuracy is meaningless and squared error barely separates models.

Raw counts have to be normalized by exposure to model a comparable per-year claim rate.

It is not obvious up front whether modern ML actually beats classical actuarial regression on this kind of low-signal data.

My Role

A three-person BSc exam project at IT University of Copenhagen. The points below are my own contributions; teammates owned the from-scratch decision tree, the Random Forest, and parts of the final evaluation.

Built the feed-forward neural network from scratch in NumPy — including the custom loss functions, optimizers, batch iterator, and training loop.

Built the PyTorch reference MLP used to sanity-check the from-scratch network.

Implemented PCA from scratch and ran the dimensionality and clustering analysis of the policy data.

Led the data cleaning, preprocessing, and exploratory analysis pipeline, and contributed to the Negative Binomial GLM modeling.

Approach

Settled on log claim-rate — log of (ClaimNb / Exposure) — as the regression target to handle exposure and heavy skew.

Modeled the rare-event structure explicitly: a Negative Binomial GLM for over-dispersed counts, with exposure carried as a GLM offset.

Validated the from-scratch models by matching their test metrics against the scikit-learn and PyTorch equivalents.

Also framed a binary claim / no-claim task evaluated with ROC AUC, given how strongly zero-claim policies dominate the data.

Results

The Negative Binomial GLM gave the best test RMSE (3.03) and an interpretable Poisson deviance of 1.88, edging out both the trees and the neural network.

The from-scratch decision tree matched its scikit-learn reference almost exactly (RMSE 3.04 vs 3.03), confirming the manual implementation was correct.

On the binary claim / no-claim task, a Random Forest reached ROC AUC 0.65.

Headline finding: on noisy, zero-inflated claim data the simple statistical model matched or beat far more complex ones — added complexity did not buy accuracy.

Policies Analyzed

678,013

French motor third-party liability dataset.

Models Compared

From-scratch DT & MLP, sklearn DT, PyTorch MLP, Random Forest, NB-GLM.

Best Test RMSE

3.03

Negative Binomial GLM, on log claim-rate.

Classification AUC

0.65

Random Forest, claim vs no-claim (ROC AUC 0.6531).

Visuals

Outputs and diagrams from the project.

Distribution of claim counts on a log scale.

Claim-count distribution (log scale) — the heavy zero-inflation that shapes every modeling choice here.

Predicted vs actual claim rate for the Negative Binomial GLM.

Parity plot for the best model, the Negative Binomial GLM.

Charts & Figures

Saved figures and chart artifacts referenced by the project.

ROC curve for the Random Forest claim classifier (AUC 0.65).

ROC curve for the Random Forest on the binary claim / no-claim task (AUC 0.6531).

Stack

PythonNumPySciPyscikit-learnPyTorchstatsmodelsPandasMatplotlibseaborn

Challenges

Extracting signal from heavily zero-inflated, over-dispersed claim data where most policies never claim.

Implementing a decision tree and neural network from scratch and getting them to match library references.

Choosing metrics — exposure-normalized rates and Poisson deviance — that stay meaningful on rare-event data.

Keeping the comparison fair across very different model families on identical splits and features.

Lessons

On noisy, low-signal tabular data, a well-specified statistical model (NB-GLM with an exposure offset) can match or beat trees and neural nets.

Building models from scratch and matching them to scikit-learn / PyTorch is the clearest way to confirm you actually understand them.

For count data, the right distributional assumption and metric matter more than model complexity.

Match model complexity to the signal in the data, not to the trend.