Facebook V: Predicting Check Ins, Winner's Interview: 2nd Place, Markus Kliegl

Facebook ran its fifth recruitment competition on Kaggle, Predicting Check Ins, from May to July 2016. This uniquely designed competition invited Kagglers to enter an artificial world made up of over 100,000 places located in a 10km by 10km square. For the coordinates of each fabricated mobile check-in, competitors were required to predict a ranked list of most probable locations. In this interview, the second place winner Markus Kliegl discusses his approach to the problem and how he relied on semi-supervised methods to learn check-in locations' variable popularity over time.

The basics

What was your background prior to entering this challenge?

I recently completed a PhD in mathematical fluid dynamics. Through various courses, internships, and contract work, I had some background in scientific computing, inverse problems, and machine learning.

Markus on Kaggle

Let's get technical

What preprocessing and supervised learning methods did you use?

The overall approach was to use Bayes' theorem: Given a particular data point (x, y, accuracy, time), I would try to compute for a suitably narrowed set of candidate places the probability

$P(place | x, y, accuracy, time) \propto P(x, y, accuracy, time | place) P(place) \,,$

and rank the places accordingly. A la Naive Bayes, I further approximated

P(x, y, accuracy, time | place) as

$P(x, y, accuracy, time | place) \approx P(x, y | place) \cdot$

$P(accuracy | place) \cdot P(time\, of\, day | place) \cdot$

$P(day\, of\, week | place) \,.$

I decided on this decomposition after a mixture of exploratory analysis and simply trying out different assumptions on the independence of variables on a validation set.

One challenge given the data size was to efficiently learn the various conditional distributions on the right-hand side. Inspired by the effectiveness of ZFTurbo's "Mad Scripts Battle" kernel early in the competition, I decided to start by just learning these distributions using histograms.

To make the histograms more accurate, I made them periodic for time of day and day of week and added smoothing using various filters (triangular, Gaussian, exponential). I also switched to C++ to further speed things up. (Early in the competition this got me to the top of the leaderboard with a total runtime of around 40 minutes single-threaded, while others were already at 15-50 hours. Unfortunately, I could not keep things this fast for very long.)

For later submissions, I averaged the P(x, y | place) histograms with Gaussian Mixture Models.

What was your most important insight into the data?

The relative popularity of places, P(place), varied substantially over time (really it should be written as P(place, time)), and it seemed hard tome to forecast it from the training data (though others like Jack (Japan) in third place had some success doing this). Since the quality of the predictions even with a rough guess for P(place) was already fairly high, however, I realized a semi-supervised approach might stand a good chance of being able to learn P(place, time). My final solution performed 20 semi-supervised iterations on the test data.

Number of check-ins over time varied irregularly for many places.

The number of checkins over time varied quite irregularly for many places. A semi-supervised approach helped me overcome this irregularity.

Getting this to actually work well took some effort. There is more discussion in this thread.

Were you surprised by any of your findings?

Accuracy was quite mysterious at first. I initially focused on analyzing the relationship between accuracy and the uncertainty in the x coordinate and tried to incorporate that into my model. However, this helped only a tiny bit. I eventually came to the conclusion that accuracy is most gainfully employed directly by adding a factor P(accuracy | place): different places attract different mixes of accuracies. As suggested in the forums, this makes sense if one thinks of accuracy as a proxy for device type.

Another surprise was this: On the last day, I tried ensembling different initial guesses for P(place), but this improved the score only by 0.00001 over the best initial guess, which in turn was only 0.00015 better than the worst initial guess. Though I was disappointed to not be able to improve my score in this way (rushed experiments on a small validation set had looked a little more promising), this insensitivity to the initial guess is actually a good property of the solution. It speaks to the stability of convergence of the algorithm.

Which tools did you use?

Apparent multinomial distributions of check-ins.

The (x, y) distributions of checkins looked multimodal. KDE or Gaussian Mixture Models were thus natural to try for learning P(x, y | place).

I used Python with the usual stack (pandas, matplotlib, seaborn, numpy, scipy, scikit-learn) for data exploration and for learning Gaussian Mixture Models for the P(x, y | place) distributions. The main model is written in C++. Finally, I used some bash scripts and the GNU parallel utility to automate parallel runs on slices of the data.

How did you spend your time on this competition?

I spent a little time early on exploring the data, in particular doing case studies of individual places. After that, I spent almost all my time on implementing, optimizing, and tuning my custom algorithm.

What was the run time for both training and prediction of your winning solution?

Aside from one-time learning of Gaussian Mixture Models (which probably took around 40 hours), the run time was around 60 CPU hours. Since the problem parallelizes well, the non-GMM run time was about 15 hours on my laptop. For the last few days of the competition, I borrowed compute time on an 8-core workstation, where the run time ended up at around around 4-5 hours.

In this Github repository, I also posted a simplified single-pass version that would have gotten me to 6th place and that runs in around 90 minutes single-threaded on my laptop (excluding the one-time GMM training time). Compared to my full solution, this semi-supervised online learning version also has the nicer property of never using any information from the future.

Bio

Markus Kliegl profile photo
Markus Kliegl recently completed a PhD in Applied and Computational Mathematics at Princeton University. His current interests lie in machine learning research and applications.

Facebook V: Predicting Check Ins, Winner's Interview: 2nd Place, Markus Kliegl

The basics

What was your background prior to entering this challenge?

Let's get technical

What preprocessing and supervised learning methods did you use?

What was your most important insight into the data?

Were you surprised by any of your findings?

Which tools did you use?

How did you spend your time on this competition?

What was the run time for both training and prediction of your winning solution?

Bio

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112