The Grupo Bimbo Inventory Demand competition ran on Kaggle from June through August 2016. Over 2000 players on nearly as many teams competed to accurately forecast sales of Grupo Bimbo's delicious bakery goods. Kaggler Alex Ryzhkov came in second place with his teammates Clustifier and Andrey Kiryasov. In this interview, Alex describes how he and his team spent 95% of their time feature engineering their way to the top of the leaderboard. Read how the team used pseudo-labeling, typically used in deep learning, to improve their final forecast.
The basics
What was your background prior to entering this challenge?
I graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015. My scientific advisor was Alexander D’yakonov, who once was the Top-1 Kaggler worldwide, and I have learnt a lot of tips and tricks from him.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
Of course I have I participated in the first rotation of PZAD course held by Alexander D'yakonov, where we developed our practical skills in machine learning competitions. Moreover, after each competition I spent several days reading winning solutions and figuring out what I could have done better.
How did you get started competing on Kaggle?
Almost at the beginning of my education in the Mathematical Methods of Forecasting department in university I joined Kaggle and totally loved it.
What made you decide to enter this competition?
I enjoyed this competition in two ways. My passion is to work with time-series data and I have several qualification works on this type of data. The second reason is that I wanted to check how far I can go using Amazon AWS servers’ power.
Let's get technical
What preprocessing and supervised learning methods did you use?
For this competition we used several XGBoost, FTRL, and FFM models, and the initial dataset was hugely increased by:
- different aggregations (mean, median, max, min etc.) of target and sales variables by week, product, client and town IDs;
- New_Client_ID feature (for example, all OXXO shops have the same ID in it instead of different ones in the dataset from Bimbo);
- features from products' names like weight, brand, number of pieces, weight of each piece;
- Truncated SVD on TF-IDF matrix of client and product names
- etc.
What was your most important insight into the data?
Since the public-private test dataset split was done in a time manner (one week in public and next week to private), we can't use features with lag equal to 1 in training our models. We did experiments for checking this point and models, which use lag_1 features, get worse score on private for 0.03-0.05 in logloss terms than models without these features.
Were you surprised by any of your findings?
It was surprising that initial client IDs worked as well as their clustered version. In the beginning of the competition I had an opinion that the initial ones have too much diversity but for the final model we saved both of them in the dataset.
Which tools did you use?
For this competition we used XGBoost packages in Python and R, as well as a Python implementation of FTRL algorithm and the FFM library for regression problems. To run heavy models on the whole dataset, spot Amazon r3.8xlarge servers were the best variant - fast and with huge RAM.
How did you spend your time on this competition?
From my point of view, it was a feature engineering competition. After my first script with XGBoost, I spent all of my time on preprocessing client and products tables, working with towns and states, creating new aggregations on sales and target variables. So it was 95% of time for feature engineering and only 5% for machine learning.
What was the run time for both training and prediction of your winning solution?
If we run it on r3.8xlarge, it will take around 146 hours (6 days) including feature engineering, training and predicting steps.
Words of wisdom
What have you taken away from this competition?
It was really surprising that pseudo labeling techniques can work outside deep learning competitions. Also you should spend a lot of time thinking about your validation and prediction techniques - it can prevent you from losing your position in the end.
Do you have any advice for those just getting started in data science?
From my side, competitions with kernels enabled are the best teachers for beginners. You can find all variety of scripts there - from simple (like all zeros or random forest on the whole initial dataset) to advanced ones (blend of several models with preprocessing and feature engineering). It's also useful to read topics on forum - you can get a number of ideas from other competitors' posts. The last advice but in my opinion the best one - don't give up!
Teamwork
How did your team form?
I was in top-20, when I got stuck and understood the necessity of new views and ideas to be in top-10 on private - at that time I merged with Clustifier and we started to work together. Later we joined with Andrey to be competitive with another top team - The Slippery Appraisals.
How did your team work together?
We had a chat in Skype (later in Google Hangouts), where we could discuss our ideas. Otherwise, all data was shared on Google Drive and we uploaded our first level submissions there. Moreover, I also shared my RStudio server on AWS with Clustifier, so we could easily work on the same files simultaneously.
How did competing on a team help you succeed?
Firstly, merging your ideas about one to two weeks before the end of competition increases your score. Secondly, you can exchange your ideas with teammates and each of them would implement those ideas in his own manner - this boosts your models even more. Finally, it's a nice time to share experience and tips & tricks, which help you to go up and improve stability of your solution before private LB.
Just for fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
It would be nice to create a small challenge with leaderboard shake up prediction. This topic is always popular on forums near the end of each competition.
What is your dream job?
Data scientist outside Russia
Bio
Alexander Ryzhkov has graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015, where his scientific advisor was Alexander D’yakonov. Now he works as a software developer in Deutsche Bank Technology Center (Moscow).