Santander Product Recommendation Competition: 3rd Place Winner's Interview, Ryuji Sakata

The Santander Product Recommendation competition ran on Kaggle from October to December 2016. Over 2,000 Kagglers competed to predict which products Santander customers were most likely to purchase based on historical data. With his pure XGBoost approach and just 8GB of RAM, Ryuji Sakata (AKA Jack (Japan)), earned his second solo gold finish by coming in 3rd place. He simplified the problem by breaking it down into several binary classification models, one for each product. Read on to learn how he dealt with unusual temporal patterns in the dataset in this competition where feature engineering was key.

The basics

What was your background prior to entering this challenge?

My university degree is in Aeronautics and Astronautics and I researched reliability engineering. There, I studied probability theory and statistics especially. Currently, I work for Panasonic Group as a data scientist for about 4 years, but I didn't have any knowledge of machine learning until starting my current work. Almost all of my knowledge of machine learning is based on my experiences from Kaggle competitions.

Ryuji Sakata (Jack (Japan)) on Kaggle.

How did you get started competing on Kaggle?

I joined Kaggle about three years ago in order to learn machine learning through practice. Now, I always want to enjoy Kaggle competitions when I have spare time.

What made you decide to enter this competition?

Before the launch of this competition, there was no running competition I could enter mainly because of data size. I have only 8GB laptop and it limits my participation in competitions. However, this competition allows me to compete with other Kagglers by using my own machine, and that’s why I entered.

Let's get technical

What was your most important insight into the data?

I inspected new purchase trends of each product, and I found that 2 specific products, cco_fin and reca_fin, had unusual trends (Figure A). Due to these unusual trends, to predict the new purchase of Jun 2016, I decided that cco_fin and reca_fin should be trained by data from different months compared to other products. Therefore, I decided to train models of each product separately by using different training data for each product rather than building just one model. (I ignored the peak of nom_pens of June because the peak of February was not periodic.)

Figure A.

What preprocessing and supervised learning methods did you use?

In this competition, extracting information from past purchase history of customers was very important. I made features as listed below:

ind_(xyz)_ult1_last: the last month index of the product (lag-1)
ind_(xyz)_ult1_00: the number of transition of index from 0 to 0 until last month
ind_(xyz)_ult1_01: the number of transition of index from 0 to 1 until last month
ind_(xyz)_ult1_10: the number of transition of index from 1 to 0 until last month
ind_(xyz)_ult1_11: the number of transition of index from 1 to 1 until last month
ind_(xyz)_ult1_0len: the length of consecutive 0 index until last month
products_last: concatenation of last month indices of products
n_products_last: the number of products purchased last month

Some of these are shown in the figures below. The feature products_last is not numeric, so it can’t be handled by XGBoost directly. It was replaced with numeric by mean value of the target variable (the height of each bar in the figure C).

Figure B.

Figure C.

The overview of training and ensemble is illustrated in the figure below. The training method I used is XGBoost only, and models of each product were trained separately as binary classification tasks. To ensemble predictions from different train data, they were normalized so that sum of probabilities of the 18 products became 1. After the normalization, multiple predictions of each product are log-averaged. Then, probabilities of all products were merged and the top 7 products were elected to make a submission.

Figure D.

Which tools did you use?

I used the R language including the packages data.table, dplyr and xgboost. I would like to master Python too in future.

What was the run time for both training and prediction of your winning solution?

The number of training process is 128 (18 products * 7 times + 2 products * 1 time).
Each training process took about 10 minutes, so the total estimated execution time is about 1280 minutes = 21 hours. Each prediction process took 1 minute or less, so the total execution time is about 2 hours.

Words of wisdom

What have you taken away from this competition?

I realize the importance of feature engineering through this competition. I think that one of the turning points of the game was how much information we could extract from the data rather than training methods or parameter tuning. It is worthwhile to take much time, I believe.

Do you have any advice for those just getting started in data science?

Let’s Kaggle together!

Bio

Ryuji Sakata works for Panasonic Group as a data scientist. He has been involved in data science for about 4 years. He holds a master's degree in Aeronautics and Astronautics from Kyoto University.

Santander Product Recommendation Competition: 3rd Place Winner's Interview, Ryuji Sakata

The basics

What was your background prior to entering this challenge?

How did you get started competing on Kaggle?

What made you decide to enter this competition?

Let's get technical

What was your most important insight into the data?

What preprocessing and supervised learning methods did you use?

Which tools did you use?

What was the run time for both training and prediction of your winning solution?

Words of wisdom

What have you taken away from this competition?

Do you have any advice for those just getting started in data science?

Bio

Read more by Ryuji Sakata

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List