The Santander Product Recommendation competition ran on Kaggle from October to December 2016. Over 2,000 Kagglers competed to predict which products Santander customers were most likely to purchase based on historical data. With his pure XGBoost approach and just 8GB of RAM, Ryuji Sakata (AKA Jack (Japan)), earned his second solo gold finish by coming in 3rd place. He simplified the problem by breaking it down into several binary classification models, one for each product. Read on to learn how he dealt with unusual temporal patterns in the dataset in this competition where feature engineering was key.
The basics
What was your background prior to entering this challenge?
My university degree is in Aeronautics and Astronautics and I researched reliability engineering. There, I studied probability theory and statistics especially. Currently, I work for Panasonic Group as a data scientist for about 4 years, but I didn't have any knowledge of machine learning until starting my current work. Almost all of my knowledge of machine learning is based on my experiences from Kaggle competitions.
How did you get started competing on Kaggle?
I joined Kaggle about three years ago in order to learn machine learning through practice. Now, I always want to enjoy Kaggle competitions when I have spare time.
What made you decide to enter this competition?
Before the launch of this competition, there was no running competition I could enter mainly because of data size. I have only 8GB laptop and it limits my participation in competitions. However, this competition allows me to compete with other Kagglers by using my own machine, and that’s why I entered.
Let's get technical
What was your most important insight into the data?
I inspected new purchase trends of each product, and I found that 2 specific products, cco_fin
and reca_fin
, had unusual trends (Figure A). Due to these unusual trends, to predict the new purchase of Jun 2016, I decided that cco_fin
and reca_fin
should be trained by data from different months compared to other products. Therefore, I decided to train models of each product separately by using different training data for each product rather than building just one model. (I ignored the peak of nom_pens
of June because the peak of February was not periodic.)
What preprocessing and supervised learning methods did you use?
In this competition, extracting information from past purchase history of customers was very important. I made features as listed below:
ind_(xyz)_ult1_last
: the last month index of the product (lag-1)ind_(xyz)_ult1_00
: the number of transition of index from 0 to 0 until last monthind_(xyz)_ult1_01
: the number of transition of index from 0 to 1 until last monthind_(xyz)_ult1_10
: the number of transition of index from 1 to 0 until last monthind_(xyz)_ult1_11
: the number of transition of index from 1 to 1 until last monthind_(xyz)_ult1_0len
: the length of consecutive 0 index until last monthproducts_last
: concatenation of last month indices of productsn_products_last
: the number of products purchased last month
Some of these are shown in the figures below. The feature products_last
is not numeric, so it can’t be handled by XGBoost directly. It was replaced with numeric by mean value of the target variable (the height of each bar in the figure C).
The overview of training and ensemble is illustrated in the figure below. The training method I used is XGBoost only, and models of each product were trained separately as binary classification tasks. To ensemble predictions from different train data, they were normalized so that sum of probabilities of the 18 products became 1. After the normalization, multiple predictions of each product are log-averaged. Then, probabilities of all products were merged and the top 7 products were elected to make a submission.
Which tools did you use?
I used the R language including the packages data.table
, dplyr
and xgboost
. I would like to master Python too in future.
What was the run time for both training and prediction of your winning solution?
The number of training process is 128 (18 products * 7 times + 2 products * 1 time).
Each training process took about 10 minutes, so the total estimated execution time is about 1280 minutes = 21 hours. Each prediction process took 1 minute or less, so the total execution time is about 2 hours.
Words of wisdom
What have you taken away from this competition?
I realize the importance of feature engineering through this competition. I think that one of the turning points of the game was how much information we could extract from the data rather than training methods or parameter tuning. It is worthwhile to take much time, I believe.
Do you have any advice for those just getting started in data science?
Let’s Kaggle together!
Bio
Ryuji Sakata works for Panasonic Group as a data scientist. He has been involved in data science for about 4 years. He holds a master's degree in Aeronautics and Astronautics from Kyoto University.