Navigating Insurance Claim Data through Tidymodels Universe


The increasing ability to store and analyze the data due to the advancement in technology has provided actuaries opportunities in optimizing capital held by insurance companies. Often, the ability to optimize the capital would lower the cost of capital for companies. This could translate into an increase in profit from the lower cost incurred or an increase in competitiveness through lowering the premiums companies charge for their insurance plans.

In this analysis, tidymodels packages are used to demonstrate how the modern data science R packages could assist the actuaries in predicting the ultimate claim cost once the claims are reported. The conformity with tidy data concepts by these R packages has flattened the learning curve to use different machine learning techniques to complement the conventional actuarial analysis. This has effectively allowed actuaries in building various machine learning models in a more tidy and efficient manner.

The packages also enable users to harass on the power of data science to mine the “gold” in unstructured data, such as claim descriptions, item descriptions, and so on. Nevertheless, these would enable the companies to hold less reserve through a more accurate claim estimation while not compromising the solvency of the companies, allowing the capital to be re-deployed for other purposes.


tidymodels, Machine Learning, Actuarial Science, Insurance, Claim Cost Estimation


The modularized structure in tidymodels has also effectively allowed the users to modularize the different model components and reused them through the analysis. This has enabled a more tidy way to create and maintain the machine learning analysis.

Also, as the functions used in this analysis are following tidy data concepts, the users would be able to pass the output from one function to another function without needing much transformations. This would shorten the required time to prepare the analysis.

`tidymmodels` Packages

Figure 1: List of tidymodels Packages

This research has also demonstrated how the various modern data science packages in R could provide actuaries another set of toolkits to complement the conventional actuarial analysis. The analysis also demonstrates how to implement the entire analysis (i.e. from data importing till communicating the results) entirely in RMarkdown, providing the readers a glimpse on how this analysis can be automated.

Model Building

Data Pre-processing

Once the data is imported into the environment, rsample package from tidymodels is used to split the dataset into training and testing dataset. The formula, dataset and the required data pre-processing steps are defined under recipe object.

Modern data science package, such as textrecipe package allows the users to extract the text information from claim descriptions as features. This would allow actuaries to tap onto their unstructured data to include more features into the machine learning models, potentially enhancing the model performance.

Following is the code chunk of how one could extract the text features from the claim description field:

ranger_recipe_clmdesc <-
  recipe(formula = init_ult_diff ~ ., data = df_wClmDesc_train) %>%
  step_tokenize(ClaimDescription) %>%
  step_stopwords(ClaimDescription) %>%
  step_tokenfilter(ClaimDescription, max_tokens = 20) %>%

Figure 2: Code Chunk to Extract Text Features

Model Selection

Conventionally, different R packages would be used to build different machine learning models. Often, the model interface from different machine learning packages can be quite different.

Different machine learning functions

Figure 3: Predict Function from Different Machine Learning Packages 1

Instead of using the different R packages, parsnip package from tidymodels provides the users an unified model interface. This package functions as a wrapper to standardize the model interface. This has effectively flatten the learning curve for actuaries to explore different machine learning algorithm to sharpen the claim cost estimation.

Unified Model Interface

Figure 4: Example of Different Model Interface

Model Fitting, Tuning and Evaluation

While defining the model specifications, the model parameters can be tagged by tune function to indicate these parameters are to be tuned during cross validation steps as shown under Figure 4.

Then, all the steps defined earlier on will be chained together as a workflow by using workflows package. The modularize approach of modeling method allows the users to reuse the previously created objects, instead of recreating the object from start. This helps to ensure the consistency between the objects created and easy to maintain the codes.

glmnet_workflow_ult <-
  glmnet_workflow %>%
  update_formula(UltimateIncurredClaimCost ~ .)

Figure 5: Example of Updating the Formula in the Created Workflow

Cross validation will be performed to search for the best set of parameters. The consistent model performance metrics allows users to loop through and calculate the necessary performance metrics.

model_metrics <- metric_set(rmse, rsq, mase)

Figure 6: Code Chunk to Define the Performance Metrics

ranger_metric <- ranger_fit %>%
  collect_predictions() %>%
  model_metrics(truth = init_ult_diff, 
                estimate = .pred)

Figure 7: Code Chunk to Compute Model Performance

Model Explanability

As the tidymodels packages also follow the tidy data concepts, it allows the users to pass the output to other packages that are using tidy data concepts without needing much data transformation. This approach also enables the users leveraging the strength of other packages to perform the analysis more effectively.

Following is an example of how one could use the output from workflow pass into ggplot2 package to visualize the results:

ranger_vip_clmdesc <- pull_workflow_fit(ranger_fit_clmdesc$.workflow[[1]]) %>%

ranger_vip_graph_clmdesc <- ranger_vip_clmdesc %>%
  slice_max(abs(Importance), n = 10) %>%
  ungroup() %>%
  mutate(Importance = abs(Importance),
  Variable = fct_reorder(Variable, Importance)) %>%
  ggplot(aes(Importance, Variable)) +
  geom_col(show.legend = FALSE) +
  labs(y = NULL, title = "Random Forest Model with TidyText")

Figure 8: Example on How Output from Tidymodels Can Pass to Other Tidy Packages Without Much Data Transformation

Below is the ggplot graph on the variable importance: Variable Importance

Figure 9: Variable Importance Graph

Availability of supporting source code and requirements

List the following:

  • Project name: Navigating Insurance Claim Data through Tidymodels Universe
  • Project home page: *to open a folder in Github
  • Operating system(s): Windows
  • Programming language: R
  • Other requirements: R 4.0.5 or higher
  • License: Nil

Data availability

The dataset can be found under Actuarial Loss Prediction Kaggle competition.


List of abbreviations

Not applicable

Competing interests

Not applicable


Not applicable

Author contributions

Your comments and questions are valued and encouraged. Contact the author at:

Name: Jasper LOK Jun Haur

Name: Professor KAM Tin Seong


We are grateful to these resources for our inspiration on this research project.

  1. Tidy Modeling with R
  2. Machine Learning Methods to Perform Pricing Optimization: A Comparison with Standard Generalized Linear Models
  3. TidyModels by Max Kuhn (24 Feb 2021) - Cleveland R User Group
  4. GLM Vs. Machine Leaning - with Case Studies in Pricing by Zhou, John, and Debbie Deng. 2019

  1. Kuhn M, Silge J, Tidy Modeling with R., 2021; [return]