The increasing ability to store and analyze the data due to the advancement in technology has provided actuaries opportunities in optimizing capital held by insurance companies. Often, the ability to optimize the capital would lower the cost of capital for companies. This could translate into an increase in profit from the lower cost incurred or an increase in competitiveness through lowering the premiums companies charge for their insurance plans.
In this analysis,
tidymodels packages are used to demonstrate how the modern data science R packages could assist the actuaries in predicting the ultimate claim cost once the claims are reported. The conformity with tidy data concepts by these R packages has flattened the learning curve to use different machine learning techniques to complement the conventional actuarial analysis. This has effectively allowed actuaries in building various machine learning models in a more tidy and efficient manner.
The packages also enable users to harass on the power of data science to mine the “gold” in unstructured data, such as claim descriptions, item descriptions, and so on. Nevertheless, these would enable the companies to hold less reserve through a more accurate claim estimation while not compromising the solvency of the companies, allowing the capital to be re-deployed for other purposes.
tidymodels, Machine Learning, Actuarial Science, Insurance, Claim Cost Estimation
The modularized structure in
tidymodels has also effectively allowed the users to modularize the different model components and reused them through the analysis. This has enabled a more tidy way to create and maintain the machine learning analysis.
Also, as the functions used in this analysis are following tidy data concepts, the users would be able to pass the output from one function to another function without needing much transformations. This would shorten the required time to prepare the analysis.
Figure 1: List of tidymodels Packages
This research has also demonstrated how the various modern data science packages in R could provide actuaries another set of toolkits to complement the conventional actuarial analysis. The analysis also demonstrates how to implement the entire analysis (i.e. from data importing till communicating the results) entirely in
RMarkdown, providing the readers a glimpse on how this analysis can be automated.
Once the data is imported into the environment,
rsample package from
tidymodels is used to split the dataset into training and testing dataset. The formula, dataset and the required data pre-processing steps are defined under recipe object.
Modern data science package, such as
textrecipe package allows the users to extract the text information from claim descriptions as features. This would allow actuaries to tap onto their unstructured data to include more features into the machine learning models, potentially enhancing the model performance.
Following is the code chunk of how one could extract the text features from the claim description field:
ranger_recipe_clmdesc <- recipe(formula = init_ult_diff ~ ., data = df_wClmDesc_train) %>% step_tokenize(ClaimDescription) %>% step_stopwords(ClaimDescription) %>% step_tokenfilter(ClaimDescription, max_tokens = 20) %>% step_tfidf(ClaimDescription)
Figure 2: Code Chunk to Extract Text Features
Conventionally, different R packages would be used to build different machine learning models. Often, the model interface from different machine learning packages can be quite different.
Figure 3: Predict Function from Different Machine Learning Packages 1
Instead of using the different R packages,
parsnip package from
tidymodels provides the users an unified model interface. This package functions as a wrapper to standardize the model interface. This has effectively flatten the learning curve for actuaries to explore different machine learning algorithm to sharpen the claim cost estimation.
Figure 4: Example of Different Model Interface
Model Fitting, Tuning and Evaluation
While defining the model specifications, the model parameters can be tagged by
tune function to indicate these parameters are to be tuned during cross validation steps as shown under Figure 4.
Then, all the steps defined earlier on will be chained together as a workflow by using
workflows package. The modularize approach of modeling method allows the users to reuse the previously created objects, instead of recreating the object from start. This helps to ensure the consistency between the objects created and easy to maintain the codes.
glmnet_workflow_ult <- glmnet_workflow %>% update_formula(UltimateIncurredClaimCost ~ .)
Figure 5: Example of Updating the Formula in the Created Workflow
Cross validation will be performed to search for the best set of parameters. The consistent model performance metrics allows users to loop through and calculate the necessary performance metrics.
model_metrics <- metric_set(rmse, rsq, mase)
Figure 6: Code Chunk to Define the Performance Metrics
ranger_metric <- ranger_fit %>% collect_predictions() %>% model_metrics(truth = init_ult_diff, estimate = .pred)
Figure 7: Code Chunk to Compute Model Performance
tidymodels packages also follow the tidy data concepts, it allows the users to pass the output to other packages that are using tidy data concepts without needing much data transformation. This approach also enables the users leveraging the strength of other packages to perform the analysis more effectively.
Following is an example of how one could use the output from workflow pass into
ggplot2 package to visualize the results:
ranger_vip_clmdesc <- pull_workflow_fit(ranger_fit_clmdesc$.workflow[]) %>% vi() ranger_vip_graph_clmdesc <- ranger_vip_clmdesc %>% slice_max(abs(Importance), n = 10) %>% ungroup() %>% mutate(Importance = abs(Importance), Variable = fct_reorder(Variable, Importance)) %>% ggplot(aes(Importance, Variable)) + geom_col(show.legend = FALSE) + labs(y = NULL, title = "Random Forest Model with TidyText")
Figure 8: Example on How Output from Tidymodels Can Pass to Other Tidy Packages Without Much Data Transformation
Below is the ggplot graph on the variable importance:
Figure 9: Variable Importance Graph
Availability of supporting source code and requirements
List the following:
- Project name: Navigating Insurance Claim Data through Tidymodels Universe
- Project home page: *to open a folder in Github
- Operating system(s): Windows
- Programming language: R
- Other requirements: R 4.0.5 or higher
- License: Nil
The dataset can be found under Actuarial Loss Prediction Kaggle competition.
List of abbreviations
Your comments and questions are valued and encouraged. Contact the author at:
Name: Professor KAM Tin Seong
We are grateful to these resources for our inspiration on this research project.
- Tidy Modeling with R
- Machine Learning Methods to Perform Pricing Optimization: A Comparison with Standard Generalized Linear Models
- TidyModels by Max Kuhn (24 Feb 2021) - Cleveland R User Group
- GLM Vs. Machine Leaning - with Case Studies in Pricing by Zhou, John, and Debbie Deng. 2019