The BNP Paribas Claims Management competition ran on Kaggle from February to April 2016. Just under 3000 teams made up of over 3000 Kagglers competed to predict insurance claims categories based on data collected during the claim filing process. The anonymized dataset challenged competitors to dig deeply into data understanding and feature engineering and the keen approach taken by Team Dexter's Lab claimed first place.
The basics
What was your background prior to entering this challenge?
Darius: BSc and MSc in Econometrics at Vilnius University (Lithuania). Currently work as an analyst at a local credit bureau, Creditinfo. My work mainly involves analyzing and making predictive models with business and consumer credit data for financial sector companies.
Davut: BSc and MSc in Computer Engineering and Electronics Engineering (Double Major), and currently PhD student in Computer Engineering at Istanbul Technical University (Turkey). I work as a back-end service software developer at company which provides stock exchange data to users. My work is not related to any data science subjects but I work on Kaggle when I get any spare time. I live in Istanbul - traffic is a headache here - I spend almost 4-5 hours in traffic and during my commute I code for Kaggle
Song: Two masters (Geological Engineering and Applied Statistics). Currently I am working in an insurance company. My work is mainly building models - pricing models for insurance products, fraud detection, etc.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
Davut: Song has lots of experience in the insurance field and Darius in the finance field. At first, we were stuck with mediocre results for 2-3 weeks until Darius came up with a great idea which we then had nice discussions about. Both perspectives helped us improve our score and lead us to the victory.
How did you get started competing on Kaggle?
Darius: I've heard of Kaggle few years ago, but just recently started Kaggling. I was looking for challenges and working with different types of data. Surprisingly, my data insights and feature engineering was good enough to claim prize money in my very first serious competition. Kaggle has become my favorite hobby since.
Davut: Three years ago, I took a course during my Master's degree when a professor gave us a term project from Kaggle (Adzuna Job Salary Prediction). I did not participate then but 6 months passed and I started to Kaggle. The Higgs Boson Machine Learning Challenge was my first serious competition and since then I've participated in more competitions and met great data scientists and friends.
Song: I have been studying machine learning by myself. Kaggle is an excellent site to learn by doing and learn from each other.
What made you decide to enter this competition?
Darius: I like working with anonymous data and I thought that I had an edge over the competition as I had discovered interesting insights in previous competitions as well. And Davut wanted to team up in a previous competition, so we joined forces early in the competition.
Davut: Kaggle became kind of an addiction to me like many others After Prudential, I wanted to participate in one more.
Song: Nothing special.
Let's get technical
What preprocessing and supervised learning methods did you use?
Darius: The most important part was setting a stratified 10-fold CV scheme early on. For most of the competition, a single XGBoost was my benchmark model (in the end, the single model would have scored 4th place). In the last 2 weeks, I made a few diverse models such as rgf, lasso, elastic net, and SVM.
Davut: We got different feature sets, and trained various diverse models on those such as knn, extra tree classifiers, random forest, and neural networks. We also tried different objectives in XGBoost.
Song: In our final model, we had XGBoost as an ensemble model, which included 20 XGBoost models, 5 random forests, 6 randomized decision tree models, 3 regularized greedy forests, 3 logistic regression models, 5 ANN models, 3 elastic net models and 1 SVM model.
What was your most important insight into the data?
Darius: The most important insight was understanding what kind of data we were given. It is hard to make assumptions about anonymous data, but I dedicated 3 weeks of competition time for data exploration, which paid its dividends.
First, as every feature in the given dataset was scaled and had some random noise introduced, I figured that identifying how to deal with noise and un-scaling the data could be important. I thought of a simple but fast method to detect the scaling factor for integer type features. It took some time, but in the end it was a crucial part of our winning solution.
Second, given our assumptions about variable meanings, we built efficient feature interactions. We devised, among other ideas, a lag and lead feature based on our impression that we were dealing with panel data. In the end, our assumptions about panel data and variable meaning were not realistic (indeed it would mean that the same client could face hundreds or thousands of claims). However, our lag and lead features did bring significant value to our solution, which is certainly because it was an efficient way to encode interactions. This is consistent with the other top two teams' solutions, which also benefited from encoding some interactions between v22 and other variables with different methods aside from lag and lead. In our opinion, there is certainly very interesting business insight for the host in these features.
Were you surprised by any of your findings?
Darius: To my surprise, our approach was not overfitting at all. Other than that, I believed in our assumptions (be they correct or not) and we figured that other teams were just doing approximations of our findings - which other top teams admitted.
How did you spend your time on this competition?
Some Kagglers start to train models in first place and keep doing that till the end and only focus on ensembling. But we focused on how to improve a single model with new features. We spent 45 days on feature engineering, then rest of the time for model training and stacking.
What was the run time for both training and prediction of your winning solution?
Our single best model only takes less than an hour to train on an 8-12 core machine. However, the ensemble itself takes several days to finish.
Words of wisdom
What have you taken away from this competition?
Darius: I tried new XGBoost parameters, which I have not tried before which also proved to be helpful in this competition. Also created my own R wrapper for rgf. Also got noticed by top Kagglers, which I did not think would happen so soon.
Davut: The team play was amazing, and we had so much fun during the competition, tried so many crazy ideas which failed mostly but still it was really fun
Song: Keep learning endlessly. Taking a competition in a team is really a happy journey.
Do you have any advice for those just getting started in data science?
Darius: Be prepared to work hard as good results don't come easy. Make a list of what you want to get good at first and prioritize. Don't let XGBoost be the only tool in your toolbox.
Davut: Spend sufficient time on feature engineering, study the previous competitions' solutions, no matter how much time has passed. For example, our winning approach is so similar to Josef Feigl's winner solution in Loan Default Prediction.
Song: Keep learning.
Just for fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
Darius: As an econometrician, I love competitions which involve predicting future trends. I'd love to put ARIMA and other time series methods into action more often.
Davut: In the Higgs Boson Challenge, high-energy physicists and data scientists competed together. I liked the spirit then and remember Lubos Motl and his posts brought new aspects to the approaches. I would like to pose a multidisciplinary problem.
Song: Any problem balancing exploration and exploitation.
What is your dream job?
Darius: Making global impact to people's lives with data science projects.
Davut: Using data science for early diagnosis for severe diseases like cancer, heart attack, etc.
Song: Data Scientist.
Acknowledgments
We want to thank this Kaggle blog post, which helped us greatly with shaking some of our prior beliefs about the data and helping with brainstorming new ideas.