From May to July 2016, over one thousand Kagglers competed in Facebook's fifth recruitment competition: Predicting Check-Ins. In this challenge, Kagglers were required to predict the most probable check-in locations occurring in artificial time and space. As the first place winner, Tom Van de Wiele, notes in this winner's interview, the uniquely designed test dataset contained about one trillion place-observation combinations, posing a huge difficulty to competitors. Tom describes how he quickly rocketed from his first getting started competition on Kaggle to first place in Facebook V through his remarkable insight into data consisting only of x,y coordinates, time, and accuracy using k-nearest neighbors and XGBoost.
The basics
What was your background prior to entering this challenge?
I have completed two Master programs at two different Belgian universities (Leuven and Ghent), one in Computer Science (2010) and one in Statistics (2016). I graduated from the Statistics program during the Kaggle competition and was combining it with a full time job at a manufacturing plant at Eastman in Ghent during the past couple of years. Initially I started as an automation engineer and in a next phase I was mostly working on process improvements using the six sigma methodology. In the beginning of 2015 I got to the really good stuff when I started working with the data science group at Eastman where I am currently employed as an analytics consultant. We solve various complex analysis problems with an amazing team and mostly rely on R which we often combine with Shiny to develop interactive web applications.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I have always had a passion for modeling complex problems and think that this mindset helped me to do well more than anything else. The problem setting is very tangible and all four predictors can be interpreted by anyone so that made it a very accessible contest where mobile data domain knowledge doesn’t really help. The problem can be translated to a classification setting with the only major complication that there are a large number of classes (>100K). I did however read a lot about other winning solutions prior to the contest. The learning from the best post on this blog has been especially useful.
How did you get started competing on Kaggle?
Through a Kaggle ‘Getting Started’ competition I was a passive user for a long time before entering my first competition. Like many others I wanted to compete one day but never really took the step to my first submission. Things changed when a colleague with a chemical engineering background wanted to get into machine learning and participated in the Kobe Bryant shot selection competition. He asked some great questions and I tried to point him into the right direction but his questions got me excited enough to download the data and implement my suggestions. Two evenings later I got close to the top 10 on the leaderboard at the time with about 500 participants and I started to dream about future competitions. That second evening was the launch date of the Facebook V competition and I wouldn’t have to dream for long!
What made you decide to enter this competition?
The promise of a possible interview at Facebook was a strong motivation to participate although I considered it to be highly unlikely given that it was my first featured Kaggle competition and I already had a fully booked agenda. My second main motivation was the promise of learning new techniques and insights from other contestants.
In hindsight I am very happy to be interviewing at one of the best companies in the world for machine learning professionals but I am even more grateful for everything I learned from my own struggles and the other participants. A competition setting makes you think outside of the box and continuously challenge your approach. The tremendous code sharing on the forums was a great catalyst in this process.
Let's get technical
Extended details of the technical approach can be found on my blog. The R code is available on my GitHub account along with high level instructions to construct the final submission.
What was your general strategy?
The main difficulty of this problem is the extended number of classes (places). With 8.6 million test records there are about a trillion (10^12) place-observation combinations. Luckily, most of the classes have a very low conditional probability given the data (x, y, time and accuracy). The major strategy on the forum to reduce the complexity consisted of calculating a separate classifier for many x-y rectangular grids. It makes much sense to make use of the spatial information since this shows the most obvious and strong pattern for the different places. This approach makes the complexity manageable but is likely to lose a significant amount of information since the data is so variable. I decided to model the problem with a single stacked two level binary classification model in order to avoid to end up with many high variance models. The lack of any major spatial patterns in the exploratory analysis supports this approach.
Generating a single classifier for all place-observation combinations would be impractical even with a powerful cluster. My approach consists of a stepwise strategy in which the place probability (the target class) conditional on the data is only modeled for a set of place candidates. A simplification of the overall strategy is shown below:
The given raw train data is split in two chronological parts, with a similar ratio as the ratio between the train and test data. The summary period contains all given train observations of the first 408 days (minutes 0-587158). The second part of the given train data contains the next 138 days and will be referred to as the train/validation data from now on. The test data spans 153 days as mentioned before.
The summary period is used to generate train and validation features and the given train data is used to generate the same features for the test data.
The three raw data groups (train, validation and test) are first sampled down into batches that are as large as possible but can still be modeled with the available memory. I ended up using batches of approximately 30,000 observations on a 48GB workstation. The sampling process is fully random and results in train/validation batches that span the entire 138 days’ train range.
Next, a set of models using 430 numeric features is built to reduce the number of candidates to 20 using 15 XGBoost models in the second candidate selection step. The conditional probability P(place_match|features) is modeled for all ~30,000*100 place-observation combinations and the mean predicted probability of the 15 models is used to select the top 20 candidates for each observation. These models use features that combine place and observation measures of the summary period.
The same features are used to generate the first level learners. Each of the 100 first level learners are again XGBoost models that are built using ~30,000*20 feature-place_match pairs. The predicted probabilities P(place_match|features) are used as features of the second level learners along with 21 manually selected features. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.
All models are built using different train batches. Local validation is used to tune the model hyperparameters.
What was your most important insight into the data?
I think I had a good insight into several of the accuracy related patterns. The accuracy distribution seems to be a mixed distribution with three peaks which changes over time. It is likely to be related to three different mobile connection types (GPS, Wi-Fi or cellular). The places show different accuracy patterns and features were added to indicate the relative accuracy group densities. The middle accuracy group was set to the 45-84 range. I added relative place densities for 3 and 32 approximately equally sized accuracy bins.
It was also discovered that the location is related to the three accuracy groups for many places. This pattern was captured by the addition of additional features for the different accuracy groups.
Studying the places with the highest daily counts also pointed me towards obvious yearly patterns which were translated to valuable features. The green line in the image below goes back 52 weeks since the highest daily count.
Were you surprised by any of your findings?
The strength of K nearest neighbors was remarkable in this problem. Nearest neighbor features make up a large share of my solution and the leading public script relied on the K nearest neighbor classifier. I was also surprised that I couldn’t find clear spatial patterns in the data (e.g. a party district).
Which tools did you use?
All code was implemented in R and I created an Rcpp package to address the major bottleneck using C++. The most important package I used was by far the data.table package. I was not familiar with the syntax heading into the competition but going through the trouble of learning it enabled me to handle the dimensions of the problem. Other critical tools are the xgboost package and the doParallel package. The exploratory data analysis lead to a Shiny application which was shared with the other participants.
How did you spend your time on this competition?
I was forced to spend the first 10 days of the competition thinking about possible high level approaches due to other priorities and ended up with an approach that strongly resembled my final framework. The next 10 days were used to generate about 50 features and build the framework except for the second level learners. This intermediate result got me to the first spot on the public leaderboard and encouraged me to expand the feature set. I spent most of the remaining time on detailed feature engineering and started building the second tier of the binary classifier three weeks before the end of the contest. The last two weeks were mostly dedicated to hyperparameter optimization.
What was the run time for both training and prediction of your winning solution?
Running all steps to train the model and generate the final submission would take about a month on my 48GB workstation. That seems like a ridiculously long time but it is explained by the extended computation time of the nearest neighbor features. While calculating the NN features I was continuously working on other parts of the workflow so speeding the NN logic up would not have resulted in a better final score.
Words of wisdom
What have you taken away from this competition?
I learned a lot from the technical issues I ran into but have learned most from the discussions on the forum. It is great to learn from brilliant people like Markus. The way he used semi-parametric learning to learn from the future was an eye-opener. Many others made significant contributions but it was especially useful to learn from Larry Freeman and Ben Hamner that we are better when we work together. An ensemble of top solutions can do much better than my winning submission!
Do you have any advice for those just getting started in data science?
I would suggest to start with a study of various data science topics. Andrew Ng’s course is an excellent place to start. Getting your hands dirty with appropriate feedback is the next step if you want to get better. Kaggle is of course an excellent platform to do so. I am very impressed with the quality and general atmosphere on the forum and would suggest everyone to start competing!
Bio
Tom Van de Wiele recently completed his master of statistical data analysis at the University of Ghent. Tom has a background in computer science engineering and works in the data science group of Eastman as an Analytics Consultant where he works on various complex data challenges. His current interests lie in applied machine learning and statistics.