The Red Hat Predicting Business Value competition ran on Kaggle from August to September 2016. Well over two-thousand players competed on 2271 teams to accurately identify potential customers with the most business value based on their characteristics and activities. In this interview, Darius Barušauskas (AKA raddar) explains how he pursued and achieved his first solo gold medal with his 1st place finish. Now an accomplished Competitions Grandmaster after one year of competing on Kaggle, Darius shares his winning XGBoost solution plus his words of wisdom for aspiring data scientists.
The basics
What was your background prior to entering this challenge?
I have been on Kaggle for a year now and it has been very exciting time of my life. ☺ In my years working in data analytics I have obtained many useful data mining and ML skills which have flourished in the Kaggle competitions I’ve participated in.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
The problem itself was not new to me - I have made several new clients’ potential detection models in my work; they were designed differently compared to Red Hat‘s problem, but such experience helped to make useful feature transformations in this competition.
What made you decide to enter this competition?
I aimed for a solo gold medal to achieve my Grandmaster’s title - it took me only a year! I am very happy that I decided to dedicate all my spare time to this and that I was able to make my goals come true – got my top 10 overall rank, nice win and a hefty reward. ☺
This competition was a tight race. How did you approach it differently from past competitions?
I have always preferred working in a team. As this was a dedicated solo run, there were times when it was hard to concentrate and easy to procrastinate - had to look for moral support from my Kaggle friends. Thank you guys!
Let’s get technical
What was your most important insight into the data?
The presence of a leakage transformed original problem into 2 sub-problems which I tackled simultaneously:
a) Interpolating outcome values for companies with some leakage information
b) Predicting outcome values for companies not affected by leakage
I chose to turn leakage into several features for my ML models to directly predict value changing points in time - a contrast to many who were using some ad hoc rules.
The data itself presented several ways to tackle the problem given Red Hat’s client company-user-activity relation. I chose to make top-down approach models – create robust company-level models first and incorporate them into activity-level models using company users’ information.
The main principle of my company-level models was to take first observation in time as a reference point for each company, then aggregate activities having same value outcome and create ML models based on that subset of data (similar model versions taking last observation as reference point as well). Having robust predictions of first and last observations translated well in capturing if/when company value changed in time. These models were critical for my solution to work, so I dedicated 90% of my time for that.
What preprocessing and supervised learning methods did you use?
My solution had a simple 4+2 model structure: 4 company-level XGBoost models incorporated in 2 activity-level XGBoost models. The first activity-level model was CV optimized (had very poor public LB performance) and the other was selected giving best public LB score; a combination of these strategies provided a huge score uplift in my final submission.
Other methods did not work as well as XGBoost. I did not want my solution to be complicated due to leak presence, so I just stuck with XGBoost. Microsoft’s brand new LightGBM would have produced even better results. So if the competition was a month or two later, I would have probably preferred LightGBM.
What was the run time for both training and prediction of your winning solution?
Due to the simplicity of the solution, it takes only a few hours on 12 thread CPU and RAM friendly environment.
How did you use Kernels in this competition?
I have produced my very first popular Kaggle kernel! I had not used sparse matrices before (surprise?) - seeing how easily these can be created and manipulated in R, I wanted to share with everyone.
Words of wisdom:
What have you taken away from this competition?
- Leave no stones unturned when it comes to testing silly ideas.
- Combination of cross-validation and public LB overfitting approach can yield surprisingly good results. Did not expect that.
- Competing solo at high ranks is very tough.
Do you have any advice for those just getting started in data science?
- Try running simple Kaggle kernels written by others and try to understand what is going on. Asking questions and receiving answers is the fastest way to know how things are done.
- Try to acquire technical skills first - try as many methods as you can, create your own code templates for running and making predictions on any given dataset.
- Learn how to do proper cross-validation and understand why it is important
- Don’t let XGBoost be the only tool in your toolbox.
Just for fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
Given a list of personal names and login names predict any risk related event using only external public internet data, especially social networks.
What is your dream job?
Developing data science models to improve the quality of everyone’s daily life.
Bio
Darius Barušauskas has BSc and MSc in Econometrics (Vilnius University, Lithuania). Specializes in credit and other risk modelling (5+ years of experience), has created many different models for financial, telco and utilities sectors. R and SQL guru.