Quantcast
Channel: Winners’ Interviews – No Free Hunch
Viewing all 62 articles
Browse latest View live

Homesite Quote Conversion, Winners' Interview: 2nd Place, Team Frenchies | Nicolas, Florian, & Pierre

$
0
0
Homesite Quote Conversion - Forking Road

The Homesite Quote Conversion competition challenged Kagglers to predict the customers most likely to purchase a quote for home insurance based on an anonymized database of information on customer and sales activity. 1925 players on 1764 teams competed for a spot at the top and team Frenchies found themselves in the money with their special blend of 600 base models. Nicolas, Florian, and Pierre describe how the already highly separable classes challenged them to work collaboratively to eke out improvements in performance through feature engineering, effective cross validation, and ensembling.

Frenchies

The team started off with Pierre and Florian as they are longtime friends. Nicolas asked to join later in the competition and it was one of the best decisions of this challenge! All of us were finalists in the “Cdiscount.com” competition hosted on datascience.net, the “French Kaggle”. It was a real pleasure for all of us to work as French guys and to demonstrate our skill on an international contest.

Nicolas Gaude

Nicolas on Kaggle

Nicolas on Kaggle

Working for Bouygues Telecom, a French telecom operator with 15M subscribers, I’m heading its data-science team with a focus on production efficiency and scalability. With a 10 years background in embedded software development, I moved to the big data domain 3 years ago and fell in love with machine learning. Kaggle is for me a unique opportunity to sharpen my skills and to compete with other data scientists around the world. And honestly Kaggle is the only place where a 0.0001% improvement matters so much that you can go for 100’s models ensemble to get to the top, and that’s a lot of fun.

Florian Laroumagne

Florian on Kaggle

Florian on Kaggle

Currently working as a BI Analyst at EDF (the major French and worldwide electricity provider) I graduated from the ENSIIE, a top French maths & IT engineering school. I had some statistical and machine learning courses however I had no opportunity in my professional life to apply it. To improve my skills, I followed some MOOC (on “france-universite-numerique” and on “Coursera”) about statistics with R, big data and machine learning. After having acquired theoretical lessons, I wanted to put them into practice. This is how I ended up on Kaggle.

Pierre Nowak

Pierre on Kaggle

Pierre on Kaggle

I graduated from ENSIIE & Université d’Evry Val d’Essonne with a double degree in Financial Mathematics. My interest in machine learning came with my participation in a text mining challenge hosted by datascience.net. I have been working for 7 months at EDF R&D first on text mining problems and recently changed to forecasting daily electricity load curves. Despite the fact many people say Kaggle is brute force only, I find it to be the place to learn brand new algorithms and techniques. I especially had the opportunity to learn Deep Learning with Keras and next level blending thanks to Nicolas and some public posts from Gilberto and the Mad Professors.

Background

None of us had prior background on the business of Homesite since we do not work in the same field. However, we weren’t hurt by this. We think the fact the data was anonymized brought most of the competitors approximately to the same level.

About the technologies used, there are two schools inside our team. Nicolas was pro-efficient in python while Florian was more R focused. Pierre was quite polyvalent and was the glue between the 2 worlds.

The solution

Feature engineering

We have to admit that feature engineering wasn’t very easy for us. Sure, we tried some differences between features which can then be selected (or not) via a feature selection process but at the end, we had only the basic dataset with a few engineered features. The kept ones were:

  • Count of 0, 1 and N/A row-wise
  • PCA top component features
  • TSNE 2D
  • Cluster ID generated with k-means
  • Some differences among features (especially the “golden” features found in a public script)

This challenge was really fun because even at the beginning of the competition, the AUC was really high (around 97% already). As we can see, the two classes are in fact quite easily separable:

Dataset plotted against the top 2 components of PCA (negative in green, positive in red)

Dataset plotted against the top 2 components of PCA (negative in green, positive in red)

Dataset plotted against the top 2 dimensions of TSNE (negative in green, positive in red)

Dataset plotted against the top 2 dimensions of TSNE (negative in green, positive in red)

Dataset modeling

As other teams, we encoded categorical features. Most of the time, it was done using a very common “label encoder”: all features where replaced with an ID. Despite the simplicity of this method, it works quite well for tree-based classifiers. However for linear ones it’s not recommended, that’s why we also generated “one hot encoded” features. Finally we also tried target encoding in order to find a ratio of the categorical features related to the target. It didn’t improve our score a lot but was worth having in our blend.

Now that we have different versions of the dataset, we also split it. We used a full version (all features, all rows) for the majority of our classifiers but we also trained weaker models based on a subset of columns. For example we trained a model on the “personal” columns only, another one on the “geographical” columns only and so on.

Training & Ensembling

With all the different versions of the dataset, we were able to train them using well known and well performing machine learning models, such as:

  • Logistic Regression
  • Regularized Greedy Forest
  • Neural Networks
  • Extra Trees
  • XGBoost
  • H2O Random Forest (just 1 or 2 models into our first stage: not really important)

Our base level consists of around 600 models. 100 were built by “hand” with different features and hyperparameters of all of the above technologies. Then, to add some diversity we built a robot creating the 500 remaining models. This robot automatically trained models with XGBoost, Logistic Regression and Neural Networks, all based on randomly chosen features.

All our models were built on a 5 fold stratified CV. It allowed us to have a local way to check our improvement and to avoid overfitting the leaderboard. Furthermore, with the CV we were able to use an ensemble method.

Example of the diversity between two models, despite the fact they are highly correlated:

Predictions of an RGF model plotted against predictions of a XGB model

Predictions of an RGF model plotted against predictions of a XGB model

To blend our 600 models, we tried different ways. After some failures, we retained 3 well performing blenders: a classical Logistic Regression, an XGBoost and (a bag) of Neural Networks. Those three blends naturally outperformed our best single model and were able to capture the information in different manners. Then, we transformed our predictions into ranks and we simply averaged the ranks of the 3 blends to have our final submission.

Here is a sketch that sums up this multi-level stacking:

Model ensemble

Words of wisdom

Here is a short list of what we learnt and / or what worked for us:

  • Read forums, there are lots of useful insights
  • Use the best script as a benchmark
  • Don’t be afraid to generate lot of models and keep all the data created this way. You could still select them later in order to blend them… Or let the blender take the decision for you 🙂
  • Validate the behavior of your CV. If you have a huge jump in local score that doesn’t reflect on the leaderboard there is something wrong
  • Grid search in order to find hyperparameters works fine
  • Blending is a powerful tool. Please read the following post if you haven't have already: http://mlwave.com/kaggle-ensembling-guide/
  • Be aware of the standard deviation when blending. Increasing the metric is good but not that much if the SD increases
  • Neural nets are capricious 🙁 bag them if necessary
  • Merging is totally fine and helps each teammates to learn from others
  • If you are a team, use collaborative tools: Skype, Slack, svn, git...

Recommendations to those just starting out

As written just above, read the forums. There are nice tips, starter codes or even links to great readings. Don’t hesitate to download a dataset and test lot of things on it. Even if most of the tested methods fail or give poor results, you will acquire some knowledge about what is working and what is not.

Merge! Seriously, learning from others is what makes you stronger. We all have different insights, backgrounds or techniques which can be beneficial for your teammates. Last but not least, do not hesitate to discuss your ideas. This way you can find some golden thoughts that can push you to the top!


Yelp Restaurant Photo Classification, Winner's Interview: 2nd Place, Thuyen Ngo

$
0
0
cupcake banner

The Yelp Restaurant Photo Classification competition challenged Kagglers to assign attribute labels to restaurants based on a collection of user-submitted photos. In this recruitment competition, 355 players tackled the unique multi-instance and multi-label problem and in this blog the 2nd place winner describes his strategy. His advice to aspiring data scientists is clear: just do it and you will improve. Read on to find out how Thuyen Ngo dodged overfitting with his solution and why it doesn't take an expert in computer vision to work with image data.

The Basics

What was your background prior to entering this challenge?

I am a PhD student in Electrical and Computer Engineering at UC Santa Barbara. I am doing research in human vision and computer vision. In a nutshell I try to understand how humans explore the scene and apply that knowledge to computer vision systems.

Thuyen (AKA Plankton) on Kaggle

Thuyen (AKA Plankton) on Kaggle

How did you get started competing on Kaggle?

My labmate introduced Kaggle to me about a year ago and I participated in several competitions since then.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

All of my projects involved images. So It's fair to say that I have some "domain knowledge". However, for this competition I think knowledge from image processing is not as important as machine learning. Since everyone uses similar image features (from pre-trained convolutional neural networks, which have been shown to contain good global descriptions of images and therefore are very suitable to our problem), the difficult part is to choose a learning framework that can combine information from different instances in an effective way.

What made you decide to enter this competition?

I am interested in image data in general (even though in the end there was not much image analysis involved). The problem is very interesting itself since there's no black-box solution that can be applied directly.

Let's get technical

What preprocessing and supervised learning methods did you use?

Like most participants, I used pre-trained convolutional networks to extract image features, I didn't do any other preprocessing or network fine-tuning. I started with the inception-v3 network from google but ended up using the pre-trained resnet-152 provided by Facebook.

My approach is super simple. It's just a neural network.

How did you deal with the multi-instance and multi-label aspect of this problem?

I used multilayer perceptron since it gave me the flexibility to handle both the multiple label and multiple instance at the same time. For multiple label, I simply used 9 sigmoid units and for multiple instance, I employed something like the attention mechanism in the neural network literature. The idea is to let the network learn by itself how to combine information from many instances (which instance to look at).

Each business is represented by a matrix of size N x 2048, where N is the number of images for that business. The network is composed of four fully connected (FC) layers. Attention model is a normal FC layer but the activation is a softmax over images, weighting the importance of each image for a particular feature. I experimented with many different architectures, in the end, the typical architecture for the final submission is as follows:

yelp-444x

The model is trained using the business-level labels (each business is a training example) as opposed to image-level labels (like many others). I used the standard cross entropy as the loss function. The training is done with Nesterov's accelerated SGD.

What was your most important insight into the data?

Finding the reliable local validation is quite challenging to me. It's a multiple label problem, and thus there's no standard stratified split. I tried some greedy methods to do stratified 5-fold split but it didn't perform very well. At the end I resorted to a random 5-fold split. My submission is normally the average of 5 models from 5-fold validation.

Another problem is that we only have 2000 businesses for training and another 2000 test cases. Even though it sounds a lot of data, training signals (the labels) are not that many. In combination with the instability of F measure, it makes the validation even more difficult.

Since the evaluation metric is F1 score, it is reasonable to use F-measure as the loss function, but somehow I couldn't make it work as well as the cross entropy loss.

With limited labeled data, my approach would have badly overfitted the data (it has more than 2M parameters). I used dropout for almost all layers, applied L2 regularization and early stopping to mitigate overfitting.

How did you spend your time on this competition?

Most of the time for machine learning (training), 1% for preprocessing I guess.

Which tools did you use?

I used Tensorflow/torch for feature extraction (with the provided code from Google/Facebook) and Lasagne (Theano) for my training.

What was the run time for both training and prediction of your winning solution?

It takes about two and a half hours to train one model and about a minute to make the predictions for all test images.

Words of wisdom

What have you taken away from this competition?

Neural networks can do any (weird) thing 🙂

Do you have any advice for those just getting started in data science?

Just do it, join Kaggle, participate and you will improve.

Bio

Thuyen Ngo is a PhD student in Electrical and Computer Engineering at the University of California, Santa Barbara.

March Machine Learning Mania 2016, Winner's Interview: 1st Place, Miguel Alomar

$
0
0
banner

The annual March Machine Learning Mania competition sponsored by SAP challenged Kagglers to predict the outcomes of every possible match-up in the 2016 men's NCAA basketball tournament. Nearly 600 teams competed, but only the first place forecasts were robust enough against upsets to top this year's bracket. In this blog post, Miguel Alomar describes how calculating the offensive and defensive efficiency played into his winning strategy.

The Basics

What was your background prior to entering this challenge?

I earned a Master’s Degree in Computer Science from UIB in Mallorca, Spain. For nearly 20 years, I have been involved in software development, business intelligence and data warehousing. Recently, I have developed an interest in analytics and forecasting.

Miguel (AKA Mallorqui) on Kaggle

Miguel (AKA Mallorqui) on Kaggle

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

In Spain, I played amateur basketball for 10 years. I like to think that is the reason I won.

The truth is I missed most of the basketball games this season and did not have a good feel for the any of the team’s quality. That most likely helped me because if I had seen more games, my judgment may have changed some of the forecasts. Normally, I am pretty bad at picking winners.

How did you get started competing on Kaggle?

I found Kaggle through some data science lessons I was taking on Coursera.

What made you decide to enter this competition?

I really like analytics and sports so I thought it was a perfect competition for me.

But the key factor is that moderators and other members make it easy to enter, they provide lots of help, data, advice and feedback. Data is already formatted and prepared so the data gathering and manipulating task is made very easy. Some members of the community seemed more interested in sharing and discovering new methods and insights than in winning the competition.

Let's get technical

What preprocessing and supervised learning methods did you use?

I used logarithmic regression and random forests. I did try ADA Boost but didn’t get very good results so I didn’t use it in my final model.

What was your most important insight into the data?

The data behind this competition is very simple, the box stats from basketball games are very simple to understand. The key factor for me was the offensive and defensive efficiency, how to calculate those? What weight to give to strength of schedule? Can you "penalize" a team because they haven’t played against the best teams in the nation? Can you lower their rating for something that didn’t happen?

Those are the kind of questions I was trying to answer, I developed several models with different degrees of adjusted efficiency ratings and checked their scores against past seasons.

Since my scores in Stage1 of the competition were not very good, I kept changing my model after Stage1 was closed.

My goal for next year is to formally test those different models to find out if there is any validity to my ideas.

first_round_upsets

Were you surprised by any of your findings?

After building the submission files, I put them into brackets using a script provided by one of the Kaggle members. My first model had a more conservative look to it and my second model (the final winner) just didn’t look right to me. Teams like SF Austin, Indiana and Gonzaga were predicted to go very far in the bracket. I almost scrapped it but since it was my 2nd model I decided to go with it. This model got most of the first round upsets right, that surprised me.

NCAA bracket

Click to expand.

Which tools did you use?

I used R, R studio and SQL.

How did you spend your time on this competition?

I would say my time allocation was 35% reading forums and blogs, 15% manipulating data, 25% building models and 25% evaluating results.

What was the run time for both training and prediction of your winning solution?

Five minutes. I trained my model using only 2016 data, so the amount of data to process is very small.

Bio

Miguel Alomar has a Master’s Degree in Computer Science from UIB in Mallorca, Spain. For nearly 20 years, he has been involved in software development, business intelligence and data warehousing.

BNP Paribas Cardif Claims Management, Winners' Interview: 1st Place, Team Dexter's Lab | Darius, Davut, & Song

$
0
0
banner

The BNP Paribas Claims Management competition ran on Kaggle from February to April 2016. Just under 3000 teams made up of over 3000 Kagglers competed to predict insurance claims categories based on data collected during the claim filing process. The anonymized dataset challenged competitors to dig deeply into data understanding and feature engineering and the keen approach taken by Team Dexter's Lab claimed first place.

The basics

What was your background prior to entering this challenge?

Darius: BSc and MSc in Econometrics at Vilnius University (Lithuania). Currently work as an analyst at a local credit bureau, Creditinfo. My work mainly involves analyzing and making predictive models with business and consumer credit data for financial sector companies.

Darius (AKA raddar) on Kaggle

Darius (AKA raddar) on Kaggle

Davut: BSc and MSc in Computer Engineering and Electronics Engineering (Double Major), and currently PhD student in Computer Engineering at Istanbul Technical University (Turkey). I work as a back-end service software developer at company which provides stock exchange data to users. My work is not related to any data science subjects but I work on Kaggle when I get any spare time. I live in Istanbul - traffic is a headache here - I spend almost 4-5 hours in traffic and during my commute I code for Kaggle 🙂

Davut on Kaggle

Davut on Kaggle

Song: Two masters (Geological Engineering and Applied Statistics). Currently I am working in an insurance company. My work is mainly building models - pricing models for insurance products, fraud detection, etc.

Song (AKA onthehilluu) on Kaggle

Song (AKA onthehilluu) on Kaggle

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Davut: Song has lots of experience in the insurance field and Darius in the finance field. At first, we were stuck with mediocre results for 2-3 weeks until Darius came up with a great idea which we then had nice discussions about. Both perspectives helped us improve our score and lead us to the victory.

How did you get started competing on Kaggle?

Darius: I've heard of Kaggle few years ago, but just recently started Kaggling. I was looking for challenges and working with different types of data. Surprisingly, my data insights and feature engineering was good enough to claim prize money in my very first serious competition. Kaggle has become my favorite hobby since.

Davut: Three years ago, I took a course during my Master's degree when a professor gave us a term project from Kaggle (Adzuna Job Salary Prediction). I did not participate then but 6 months passed and I started to Kaggle. The Higgs Boson Machine Learning Challenge was my first serious competition and since then I've participated in more competitions and met great data scientists and friends.

Song: I have been studying machine learning by myself. Kaggle is an excellent site to learn by doing and learn from each other.

What made you decide to enter this competition?

Darius: I like working with anonymous data and I thought that I had an edge over the competition as I had discovered interesting insights in previous competitions as well. And Davut wanted to team up in a previous competition, so we joined forces early in the competition.

Davut: Kaggle became kind of an addiction to me like many others 🙂 After Prudential, I wanted to participate in one more.

Song: Nothing special.

Let's get technical

What preprocessing and supervised learning methods did you use?

Darius: The most important part was setting a stratified 10-fold CV scheme early on. For most of the competition, a single XGBoost was my benchmark model (in the end, the single model would have scored 4th place). In the last 2 weeks, I made a few diverse models such as rgf, lasso, elastic net, and SVM.

Davut: We got different feature sets, and trained various diverse models on those such as knn, extra tree classifiers, random forest, and neural networks. We also tried different objectives in XGBoost.

Song: In our final model, we had XGBoost as an ensemble model, which included 20 XGBoost models, 5 random forests, 6 randomized decision tree models, 3 regularized greedy forests, 3 logistic regression models, 5 ANN models, 3 elastic net models and 1 SVM model.

What was your most important insight into the data?

Darius: The most important insight was understanding what kind of data we were given. It is hard to make assumptions about anonymous data, but I dedicated 3 weeks of competition time for data exploration, which paid its dividends.

First, as every feature in the given dataset was scaled and had some random noise introduced, I figured that identifying how to deal with noise and un-scaling the data could be important. I thought of a simple but fast method to detect the scaling factor for integer type features. It took some time, but in the end it was a crucial part of our winning solution.

Second, given our assumptions about variable meanings, we built efficient feature interactions. We devised, among other ideas, a lag and lead feature based on our impression that we were dealing with panel data. In the end, our assumptions about panel data and variable meaning were not realistic (indeed it would mean that the same client could face hundreds or thousands of claims). However, our lag and lead features did bring significant value to our solution, which is certainly because it was an efficient way to encode interactions. This is consistent with the other top two teams' solutions, which also benefited from encoding some interactions between v22 and other variables with different methods aside from lag and lead. In our opinion, there is certainly very interesting business insight for the host in these features.

Were you surprised by any of your findings?

Darius: To my surprise, our approach was not overfitting at all. Other than that, I believed in our assumptions (be they correct or not) and we figured that other teams were just doing approximations of our findings - which other top teams admitted.

How did you spend your time on this competition?

Some Kagglers start to train models in first place and keep doing that till the end and only focus on ensembling. But we focused on how to improve a single model with new features. We spent 45 days on feature engineering, then rest of the time for model training and stacking.

What was the run time for both training and prediction of your winning solution?

Our single best model only takes less than an hour to train on an 8-12 core machine. However, the ensemble itself takes several days to finish.

Words of wisdom

What have you taken away from this competition?

Darius: I tried new XGBoost parameters, which I have not tried before which also proved to be helpful in this competition. Also created my own R wrapper for rgf. Also got noticed by top Kagglers, which I did not think would happen so soon.

Davut: The team play was amazing, and we had so much fun during the competition, tried so many crazy ideas which failed mostly but still it was really fun 🙂

Song: Keep learning endlessly. Taking a competition in a team is really a happy journey.

Do you have any advice for those just getting started in data science?

Darius: Be prepared to work hard as good results don't come easy. Make a list of what you want to get good at first and prioritize. Don't let XGBoost be the only tool in your toolbox.

Davut: Spend sufficient time on feature engineering, study the previous competitions' solutions, no matter how much time has passed. For example, our winning approach is so similar to Josef Feigl's winner solution in Loan Default Prediction.

Song: Keep learning.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

Darius: As an econometrician, I love competitions which involve predicting future trends. I'd love to put ARIMA and other time series methods into action more often.

Davut: In the Higgs Boson Challenge, high-energy physicists and data scientists competed together. I liked the spirit then and remember Lubos Motl and his posts brought new aspects to the approaches. I would like to pose a multidisciplinary problem.

Song: Any problem balancing exploration and exploitation.

What is your dream job?

Darius: Making global impact to people's lives with data science projects.

Davut: Using data science for early diagnosis for severe diseases like cancer, heart attack, etc.

Song: Data Scientist.

Acknowledgments

We want to thank this Kaggle blog post, which helped us greatly with shaking some of our prior beliefs about the data and helping with brainstorming new ideas.

Home Depot Product Search Relevance, Winners' Interview: 1st Place | Alex, Andreas, & Nurlan

$
0
0
homedepot_banner

A total of 2,552 players on over 2,000 teams participated in the Home Depot Product Search Relevance competition which ran on Kaggle from January to April 2016. Kagglers were challenged to predict the relevance between pairs of real customer queries and products. In this interview, the first place team describes their winning approach and how computing query centroids helped their solution overcome misspelled and ambiguous search terms.

The Basics

What was your background prior to entering this challenge?

Andreas: I have a PhD in Wireless Network Optimization using statistical and machine learning techniques. I worked for 3.5 years as Senior Data Scientist at AGT International applying machine learning in different types of problems (remote sensing, data fusion, anomaly detection) and I hold an IEEE Certificate of Appreciation for winning first place in a prestigious IEEE contest. I am currently Senior Data Scientist at Zalando SE.

Alex: I have a PhD in computer science and work as data science consultant for companies in various industries. I have built models for e-commerce, smart home, smart city and manufacturing applications, but never worked on a search relevance problem.

Nurlan: I recently completed my PhD in biological sciences where I worked mainly with image data for drug screening and performed statistical analysis for gene function characterization. I have also experience in application of recommender system approaches for novel gene function predictions.

How did you get started competing on Kaggle?

Nurlan: The wide variety of competitions hosted on Kaggle motivated me to learn more about applications of machine learning across various industries.

Andreas: The opportunity to work with real-world datasets from various domains and also interact with a community of passionate and very smart people was a key driving factor. In terms of learning while having fun, it is hard to beat the Kaggle experience. Also, exactly because the problems are coming from the real world, there are always opportunities to apply what you learned in a different context, be it another dataset or a completely different application domain.

Alex: I was attracted by the variety of real world datasets hosted on Kaggle and the opportunity to learn new skills and meet other practitioners. I was a bit hesitant to join competitions in the beginning as I was not sure if I would be able to dedicate the time for it, but then never regretted to get started. The leaderboard, the knowledge exchange in forums and working in teams creates a very exciting and enjoyable experience, and I was often able to transfer knowledge gained on Kaggle to customer problems in my day job.

What made you decide to enter this competition?

Alex: Before Home Depot, I participated in several competitions with anonymized datasets where feature engineering was very difficult or didn’t work at all. I like the creative aspect of feature engineering and I expected a lot of potential for feature engineering in this competition. Also I saw a chance to improve my text mining skills on a very tangible dataset.

Nurlan: I had two goals in this competition: mastering state of the art methods in natural language processing and model ensembling techniques. Teaming up with experienced kagglers and kaggle community through forums provided opportunities to achieve my goals.

Andreas: Learning more about both feature engineering and ML models that are doing well in NLP was a first driver. The decent but not overwhelming amount of data gave also good opportunities for ensembling and trying to squeeze the most out of the models, something that I enjoy doing when there are no inherent time or other business constraints (as is often the case in commercial data science applications).

Let’s get technical

What preprocessing and supervised learning methods did you use?

Overview of our prediction pipeline

Figure 1: Overview of our prediction pipeline - most important features and models highlighted in orange.

Preprocessing and Feature Engineering

Our preprocessing and feature engineering approach can be grouped into five categories: keyword match, semantic match, entity recognition, vocabulary expansion and aggregate features.

Keyword Match

In keyword match we counted the number of matching terms between search term and different sections of product information and also stored the matching term position. To overcome the misspellings we used fuzzy match where we counted the character n-grams matches instead of complete term. We also computed tf-idf normalized scores of the matching terms to normalize for the non-specific term matches.

Semantic Match

Visualization of word embedding vectors trained on product descriptions and titles

Figure 2: Visualization of word embedding vectors trained on product descriptions and titles - related words cluster in word embedding space (2D projection using multi-dimensional scaling on cosine distance matrix, k-means clustering).

To capture the semantic similarity (e.g. shower vs bathroom) we performed matrix decomposition using latent semantic analysis (LSA) and non-negative matrix factorization (NMF). To further catch the similarities that were not captured with LSA or NMF, which were trained on Home Depot corpus, we used pre-trained word2vec and GloVe word embeddings that are trained on various external corpora. Among LSA, NMF, GloVe and word2vec, GloVe word embeddings gave the best performance. See in figure 2 how it captures similar entities.

Main Entity Extraction

The main motivation was to extract main entities being searched and being described in the queries and product titles respectively. Our primary approach was to include positional information of the matched terms but oob error analysis revealed that it was not enough. We also experimented with POS tagging but we noticed that many of the terms that represent entity attributes and specifications were also captured as nouns and there was no obvious pattern to distinguish them from the the main entity terms. Instead, we decided to extract last N terms as potential main entities after reversing the order of the terms whenever we see prepositions such as "for", "with", "in", etc., which were usually followed by entity attributes/specifications.

Vocabulary Expansion

To catch 'pet' vs 'dog' type of relationships we performed vocabulary expansion for main entities extracted from the search terms and product titles. Vocabulary expansion included synonym, hyponym and hypernym extraction from WordNet.

Aggregate Features

See “What was your most important insight into the data?” section for details.

Feature Interactions

We also performed basis expansions by including polynomial interaction terms between important features. These features also contributed further to the performance of our final model.

Supervised Learning Methods

Apart from the usual suspects like xgboost, random forest, extra trees and neural nets, we worked quite a lot with combinations of unsupervised feature transformations and generalized linear models, especially sparse random and Gaussian projections as well as Random Tree Embeddings (which did really good). On the supervised part, we tried a large number of Generalized Linear Models using the different feature transformations and different loss functions. Bayesian Ridge and Lasso with some of the transformed features did really well, the first also getting almost no hyperparameter tuning (and thus saving time). Another thing that worked really good was the regression through classification approach based on Extra Tree Classifiers. Selecting the optimal number of classes and tweaking the model to get reliable posterior probability estimates was important and took computational effort but it contributed some of the best models (just next to the very best xgboost models).

The idea was always to get models that are individually good on their own but have as little correlation as possible so that they can contribute meaningfully in the ensemble. The feature transformations, different loss functions, regression through classification, etc. all played well in this general goal.

Comparison of unsupervised random tree embedding and supervised classification in separating the relevant and non-relevant points (2D projections).

Figure 3. Comparison of unsupervised random tree embedding and supervised classification in separating the relevant and non-relevant points (2D projections).

The two figures above are showing the effectiveness of the unsupervised Random Tree Embedding transform (upper of the two pictures). The separation visualized here is between two classes only (highly relevant points tend to be high on the left and not relevant low and towards the right) and it is mingled. But we need to consider that this is a 2D projection done in a completely unsupervised way (the classes are actually visualized on top of the data and the labels were not used for anything other than visualization). For comparison, the other image (bottom picture) visualizes the posterior estimates for the two classes derived from a supervised Extra Tree classification algorithm (again the highly relevant area is up and to the left, while the non-relevant bottom right).

How did you settle on a strong cross-validation strategy?

Alex: I think everyone joining the competition realized very early that a simple cross-validation does not properly reflect the generalization error on the test set. The amount of search terms and products only present in the test set biased the cross-validation error and lead to overfitted models. To avoid that, I tried first to generate cross-validation folds that account for both unseen search terms and products simultaneously, but I was not able to come up with a sampling strategy that meets these requirements. I finally got the idea to “ensemble” multiple sampling schemes and it turned out to work very well. We created two runs of 3-fold cross-validation with disjoint search terms among the folds, and one 3-fold cross-validation with disjoint product id sets. Taking the average error of the three runs turned out to be a very good predictor for the public and private leaderboard score.

What was your most important insight into the data?

Information extraction about relevance of products to the query and quantification of query ambiguity by aggregating the products retrieved for each query.

Figure 4. Information extraction about relevance of products to the query and quantification of query ambiguity by aggregating the products retrieved for each query.

In the beginning we were measuring search term to product similarity by different means, but search terms were quite noisy (i.e. misspellings). Since most of the products retrieved are relevant, we clustered products for each query, then computed cluster centroid and used this centroid as a reference. Calculating similarity of the products to the query centroid provided powerful information (See figure above, left panel).

On top of this, some queries are ambiguous (e.g. ‘manual’ as opposed to ‘window lock’) and these ambiguous terms would be unclear for the human raters too and might lead to less relevant score. We decided to include this information as well by computing the mean similarity of the products to the query centroid for each query. Figure above (right panel) shows this relationship.

Were you surprised by any of your findings?

Andreas: One surprising finding was that the residual errors of our predictions were exhibiting a strange pattern (different behavior in the last few tens of thousands of records), that hinted towards a bias somewhere in the process. After discussing it, we thought that a plausible explanation was a change of annotators or change in the annotations policy. We decided to model this by adding a binary variable (instead of including the id directly) and it proved a good bet.

Alex: I was surprised by the excellent performance of word embedding features compared to classical TF-IDF approach, even though the word embeddings were trained on a rather small corpus.

Which tools did you use?

Andreas: We used a Python tool chain, with all of the standard tools of the trade (scikit-learn, nltk, pandas, numpy, scipy, xgboost, keras, hyperopt, matplotlib). Sometimes R was also used for visualization (ggplot).

How did you spend your time on this competition?

Alex: We spent most of the time on preprocessing and feature engineering. To tune the models and the ensemble, we reused code from previous competitions to automate hyperparameter optimization, cross-validation and stacking, so we could run them overnight and while we were at work.

What was the run time for both training and prediction of your winning solution?

Alex: To be honest, recalculating the full feature extraction and model training pipeline takes several days, although our best features and models would finish after a few hours. We often tried to remove models and features to reduce the complexity of our solution, but it almost always increased the prediction error. So we kept adding new models and features incrementally over several weeks, leading to more than 20 independent feature sets and about 300 models in the first ensemble layer.

Words of wisdom

What have you taken away from this competition?

Alex: Never give up and keep working out new ideas, even if you are falling behind on the public leaderboard. Never throw away weak features or models, they could still contribute to your final ensemble.

Nurlan: Building a cross-validation scheme that's consistent with leaderboard score and power of ensembling.

Andreas: Persistency and application of best practices on all aspects (cross-validation, feature engineering, model ensembling, etc.) is what makes it work. You cannot afford to skip any part if you want to compete seriously in Kaggle these days.

Do you have any advice for those just getting started in data science?

Alex: Data science is a huge field - focus on a small area first and approach it through hands-on experimentation and curiosity. For machine learning, pick a simple toy dataset and an algorithm, automate the cross validation, visualize decision boundaries and try get a feeling for the hyperparameters. Have fun! Once you feel comfortable, study the underlying mechanisms and theories and expand your experiments to more techniques.

Nurlan: This was my first competition and in the beginning I was competing alone. The problem of unstable local CV score demotivated me a bit as I couldn't tell how much my new approach helped until I made a submission. Once I joined the team, I learnt great deal from Alexander and Andreas. So get into a team with experienced Kagglers.

Andreas: I really recommend participating in Kaggle contests even for experienced data scientists. There is a ton of things to learn and doing it while playing is fun! Even if in the real world you will not get to use an ensemble of hundreds of models (well most of the time at least), learning a neat trick on feature transformations, getting to play with different models in various datasets and interacting with the community is always worth it. Then you can pick a paper or ML book and understand better why that algorithm worked or did not work so well for a given dataset and perhaps how to tweak it in a situation you are facing.

Teamwork

How did your team form?

Alex: Andreas and me are former colleagues and after we left the company we always planned to team up once for a competition. I met Nurlan at a Predictive Analytics meet-up in Frankfurt and invited him to join the team.

How did your team work together?

Alex: We settled on a common framework for the machine learning part at the very beginning and synchronized changes in the machine learning code and in hyper parameter configuration using a git repository. Nurlan and me had independent feature extraction pipelines, both producing serialized pandas dataframes. We shared those and the oob predictions using cloud storage services. Nurlan produced several new feature sets per week and kept Andreas and me very busy tuning and training models for them. We communicated mostly via group chat in Skype, only had two voice calls during the whole competition.

How did competing on a team help you succeed?

Andreas: We combined our different backgrounds and thus were able to cover a lot of alternatives fast. Additionally, in this contest having a lot of alternate ways of doing things like pre-processing, feature engineering, feature transformations, etc. was quite important in increasing the richness of the models that we could add in our stacking ensemble.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

Alex: If I had access to a suitable dataset, I would run a competition on predictive maintenance to predict remaining useful lifetime of physical components. Also I would love to work on a competition where reinforcement learning can be applied.

The Team

Dr. Andreas Merentitis received B.Sc., M.Sc., and Ph.D. degrees from the Department of Informatics and Telecommunications, National Kapodistrian University of Athens (NKUA) in 2003, 2005, and 2010 respectively. Between 2011-2015 he was Senior Data Scientist at AGT International. Since 2015 he works as Senior Data Scientist at Zalando SE. He has more than 30 publications in machine learning, distributed systems, and remote sensing, including publications in flagship conferences and journals. He was awarded an IEEE Certificate of Appreciation as a core member of the team that won the first place in the “Best Classification Challenge” of the 2013 IEEE GRSS Data Fusion Contest. He has a master ranking in Kaggle.

Alexander Bauer is a data science consultant with 10 years of experience in statistical analysis and machine learning. He holds a degree in electrical engineering and a PhD in computer science.

Nurlanbek Duishoev received his BSc and PhD degrees in biological sciences from Middle East Technical and from Heidelberg University respectively. His research focused on drug screenings and biological image data analysis. He later moved on to apply recommender system approaches for gene function prediction. The wealth of data being generated in the biomedical field, like in many other industries, motivated him to master state-of-the-art data science techniques via various MOOCs and participate in Kaggle contests.

Home Depot Product Search Relevance, Winners' Interview: 3rd Place, Team Turing Test | Igor, Kostia, & Chenglong

$
0
0
banner-1000x

The Home Depot Product Search Relevance competition which ran on Kaggle from January to April 2016 challenged Kagglers to use real customer search queries to predict the relevance of product results. Over 2,000 teams made up of 2,553 players grappled with misspelled search terms and relied on natural language processing techniques to creatively engineer new features. With their simple yet effective features, Team Turing Test found that a carefully crafted minimal model is powerful enough to achieve a high ranking solution. In this interview, Team Turing Test team walks us through their full approach which landed them in third place.

The basics

What was your background prior to entering this challenge?

Chenglong Chen: I have received my Ph.D. degree in Communication Engineering from Sun Yat-sen University, Guangzhou, China, in 2015. As a Ph.D. student, I mainly focused on passive digital image forensics, and applied various machine learning methods, e.g., SVM and deep learning, to detect whether a digital image has been edited/doctored. I have participated in a few Kaggle competitions before this one.

Igor Buinyi: I was a Ph.D. student at the Institute of Physics, Academy of Sciences of Ukraine. Later I received my MA degree in Economics from Kyiv School of Economics, then applied my skills in the financial sector and computer games industry. I have solid skills in statistics, data analysis and data processing, but I started seriously working on machine learning only recently after having graduated with the Udacity Data Analyst Nanodegree. Now I analyze customer behavior at Elyland.

Kostiantyn Omelianchuk: I have an MS degree in System Analysis and Management from Kyiv Polytechnic Institute and 3 years of data analysis and data processing experience in the financial sector and game development industry. Now I analyze customer behavior at Elyland.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Chenglong: I have limited knowledge of NLP tasks. In addition to the CrowdFlower Search Results Relevance competition at Kaggle (where I ended up in 1st place), reading forum posts, previous Kaggle winning solutions, related papers, and lots of Google searches have given me many inspirations.

Igor: In a student project, I used NLP and machine learning to analyze the Enron email dataset.

How did you get started competing on Kaggle?

Kostia: Almost a year ago I read an article about a solution to a Kaggle competition. I was very impressed by that text, so I registered at Kaggle and invited Igor to do the same. We then participated in Springleaf Marketing Response, however, we only finished in the top 15%.

Chenglong: It dates back 2012. At that time, I was taking Prof. Hsuan-Tien Lin's Machine Learning Foundations course on Coursera. He encouraged us to compete on Kaggle to apply what we have learnt to real world problems. From then on, I have occasionally participated in competitions I find interesting.

What made you decide to enter this competition?

Igor: I just graduated from the Udacity Nanodegree program and was eager to apply my new skills in a competition, no matter what. At that time this competition had started, so I joined and invited Kostia to the team.

Chenglong: I have prior successful experience in the CrowdFlower Search Relevance competition on Kaggle which is quite similar to this one. I also had some spare time and wanted to strengthen my skills.

Let's get technical

What preprocessing and supervised learning methods did you use?

The documentation and code for our approach are available here. Below is a high level overview of the method.

method flowchart

Fig. 1 Overall flowchart of our method.

The text preprocessing step included text cleaning (removing special characters, unifying the punctuation, spell correction and thesaurus replacement, removing stopwords, stemming) and finding different meaningful parts in text (such as concatenating digits with measure units, extracting brands and materials, finding part-of-speech tags for each word, using patterns within the text to identify product names and other important words).

Our features can be grouped as follows:

    • Basic features like word count, character count, percentage of digits, etc.
    • Intersection and distance features (various intersections between search term and other text information, Jaccard and Dice coefficients calculated using different algorithms).
    • Brand/material match features. Brands and materials were among the key determinants of the search results relevance.
    • First and last ngram features. They are designed to put more weight on the first and last ngram. The general idea would be to incorporate position weighting into these features.
    • LSA features. Most of our models are not efficient in dealing with sparse TFIDF vector space features. So we used the dimension reduced version via SVD.
    • TFIDF features in different combinations. They allow accounting for word frequency within the whole text corpus or particular query-product pair.
    • Word2vec features from pretrained models or models trained on the HomeDepot data.
    • WordNet similarity features. It allowed us to assess the closeness of the words as defined by NLTK WordNet. We calculated synset similarity for pairs of important words (ideally, we wanted important words to be product names and their main characteristics) or even whole strings.
    • Query expansion. The idea was to group similar queries together and then estimate how common a particular product description was. The relevance distribution was significantly skewed to 3, so in each small subset the majority of relevances would be closer to 3 than to 1. Thus, a higher intersection of a particular product description with the list of the most common words would indicate higher relevance.
    • Dummies: brands, materials, important words.

How did you settle on a strong cross-validation strategy?

Chenglong: After seeing quite a few LB shakeups in Kaggle competitions, the first thing I do after entering a Kaggle competition is performing some data analysis and setting up an appropriate cross-validation strategy. This is quite important as the CV results will act as our guide for optimizing in the remaining procedure (and we don’t want to be misled). It also will help to accelerate the circle of change-validate-check. After some analysis, we have found that there are search terms only in the training set, only in the testing set and those in both training and testing set. This also applies for product uid. Taking these into consideration, we have designed our splitter for the data. Using this splitter, the gap between local CV and public LB is consistent around 0.001~0.002 which is within the 2-std range of CV for both single models and ensembled models. In the following figure, we compare different splits of product uid and search term via Venn-Diagram.

actual search term naive split search term proposed search term actual product UID naive split product UID proposed product UID Fig. 2 Comparison of different splits on search term (top row) and product uid (bottom row). From left to right, shown are actual train/test split, naive 0.69 : 0.31 random shuffle split, and the proposed split.

What was your most important insight into the data?

Chenglong: Including this competition, I have participated in two relevance prediction competitions. From my point of view, the most important features are those measuring the distance between searching query and the corresponding results. Such distance can be measured via various distance measurements (Jaccard, Dice, cosine similarity, KL, etc.), and from different level (char ngram, word ngram, sentence, document), and with different weighting strategy (position weighting, IDF weighting, BM25 weighting, entropy weighting).

Kostia: We found the data too noisy to be captured by a single unified text processing approach. Therefore, we needed to apply as many preprocessing/feature extraction algorithms as we could in order to get a better performance.

Igor: We also found it extremely important to account for the number of characters in a word while generating various features. For example, the two strings ‘first second’ and ‘first third’ have Jaccard coefficient of 1/(2+2-1)=1/3 in terms of words and 5/(11+10-5)=5/16 in terms of characters. Such incorporation of the number of characters information into the intersection features, distance features and TFIDF features was very beneficial.

Were you surprised by any of your findings?

Kostia: I discovered that routine work gives 90% of the result. Only the minor part comes from luck and knowledge of ‘magic’ data science, at least in this competition. One could also easily compensate the lack of particular knowledge with some invested time and effort. Perseverance and desire to learn are the key factors for good performance.

Chenglong: Some data in the testing set are poisoned data that are not used in both public LB and private LB.

Igor: During the competition we had to add more and more features to move forward. In total we ended up with about 500 features used in our single models. After the competition had ended, we tried to produce a simple model as required by the Kaggle solution template. We were surprised that our 10-feature model would yield private RMSE of 0.44949 without any ensembling, enough to be on the 31st leaderboard place. It shows that a solution to such a problem could be simple and effective at the same time. (To be clear, we did not use the released information about the relevance from the test set to produce our simple model).

Which tools did you use?

Chenglong: We mostly used Python with packages such as numpy, pandas, regex, mathplotlib, gensim, hyperopt, keras, NLTK, sklearn, xgboost. I also used R with Rtsne package for computing the TNSE features. Igor and Kostia have used Excel to perform some descriptive analysis and generate some charts.

How did you spend your time on this competition?

Igor: Kostia and I spent almost equal amounts of time on feature engineering and machine learning. For some tasks we employed specialization: I focused on text preprocessing and using NLTK WordNet, Kostia got the most from word2vec. At some point we realized that we had chances to win, so we had to properly document our work and adhere to the requirements of the winning solutions. So, in four final weeks of the competition we spent much time on rewriting and clearing our code and recalculating features according to a unified text processing algorithm (which did not lead to an improvement of the results; we even lose some performance due to a reduced variance of our features).

Chenglong: In the very beginning of the competition, I focused on figuring out the appropriate CV strategy. After that I have started to reproduce the solution I used in the CrowdFlower competition. Meanwhile, I spent quite some time refactoring the whole framework for the purpose of adding data processor/feature generator/model learner easily. With a scalable codebase, I spent about 70% of the time figuring out and generating various and effective features.

Kostia: During the final week the whole team spent a few days on discussing patterns in the dataset (Fig. 3) and trying to figure out how to deal with them. Since our model predictions (one coming from Chenglong with some our features and another from me and Igor) for the test set were quite different, we expected that one of our approaches would be more robust to the leaderboard shakeup. In fact, we did not observe a major difference between the models’ performance in the public and private sets.

charts demonstrating patterns in the dataset

Fig. 3 Some charts that demonstrate the patterns in the dataset.

What was the run time for both training and prediction of your winning solution?

Igor: The text preprocessing and feature generation part takes a few days. Though a single xgboost model can be trained in about 5 minutes, training and predicting all models for the winning solution takes about 1 full day of computation.

Words of wisdom

What have you taken away from this competition?

    • Spelling correction/synonyms replacement/text cleaning are very useful for searching query relevance prediction as also revealed in CrowdFlower.
    • In this competition, it is quite hard to improve the score via ensemble and stacking. To get the most out of ensemble and stacking, one should really focus on introducing diversity. To that goal, try various text processing, various feature subsets, and various learners. Team merging also contributes a lot of diversity!
    • That said, know how to separate the effect of an improved algorithm from the effect of increased diversity. Careful experiments are necessary. If some clear and effective solution does not work as well as your old models, do not discard it until you compare both approaches in the same conditions.
    • Keep a clean and scalable codebase. Keep track/log of the change you have made, especially how you create those massive crazy ensemble submissions.
    • Set up an appropriate and reliable CV framework. This allows trying various local optimizations on the features and models and getting feedback without the need to submit them. This is important for accelerating the change-validate-check cycle.

Teamwork

How did your team form?

Igor: Kostia and I worked together since the start of the competition. Before the merger deadline, the competitors started to merge very intensely. We also realized that we had to merge in order to remain on the top. If we remember correctly, at the moment of our final merger Chenglong was the only sole competitor in the top 10 and there were few other teams of two in the top 10. So, our merger was natural.

How did your team work together?

Chenglong: For most of our discussion, we used Skype. For data and file sharing, we used Google Drive and Gmail.

How did competing on a team help you succeed?

Chenglong: About one or two weeks before the team merging deadline, I was the only one in the top 10 that competed solely. If I hadn't merged with Igor and Kostia, I might not have been able to enter the top 10 in the private LB. Competing on a team helps to introduce variants for obtaining better results. Also, we have learned a lot from each other.

Igor: All top teams in this competition merged during the later stages, and this fact signifies the importance of merging for achieving top performance. In our team we were able to quickly test a few ideas due to information sharing and cooperation.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

Kostia: Developing an AI or recommendation tool for poker.

What is your dream job?

Chenglong: Chef.

Home Depot Product Search Relevance, Winners' Interview: 2nd Place | Thomas, Sean, Qingchen, & Nima

$
0
0
banner-1000x

The Home Depot Product Search Relevance competition challenged Kagglers to predict the relevance of product search results. Over 2000 teams with 2553 players flexed their natural language processing skills in attempts to feature engineer a path to the top of the leaderboard. In this interview, the second place winners, Thomas (Justfor), Sean (sjv), Qingchen, and Nima, describe their approach and how diversity in features brought incremental improvements to their solution.

The basics

What was your background prior to entering this challenge?

Thomas is a pharmacist, with his PhD in Informatics and Pharmaceutical Analytics and works in Quality in the pharmaceutical industry. At Kaggle he joined earlier competitions and got the Script of the Week award.

Sean is an undergraduate student in computer science and mathematics at the Massachusetts Institute of Technology (MIT).

Kaggler Justfor Kaggler sjv

Qingchen is a data scientist at ORTEC Consulting and a PhD researcher at the Amsterdam Business School. He has experience competing on Kaggle but this was the first time with a competition related to natural language processing.

Nima is a PhD candidate at the Lassonde School of Engineering at York University focusing on research in data mining and machine learning. He has also experience competing on Kaggle but up to now focused on other types of competitions.

Kaggler Qingchen Kaggler Nima

Between the four of us, we have quite a bit of experience with Kaggle competitions and machine learning, but minor experience in natural language processing.

What made you decide to enter this competition?

For all of us, the primary reason was that we wanted to learn more about natural language processing (NLP) and information retrieval (IR). This competition turned out to be great for that, especially in providing practical experience.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

All of us have strong theoretical experience with machine learning in general, and it naturally helps with the understanding and implementation of NLP and IR methods. However, none of us have had any real experience in this domain.

Let's get technical

What preprocessing and supervised learning methods did you use?

The key to this competition was mostly preprocessing and feature engineering as the primary data is text. Our processed text features can broadly be grouped into a few categories: categorical features, counting features, co-occurrence features, semantic features, and statistical features.

  • Categorical features: Put words in categories such as colors, units, brands, core. Count the number of those words in the query/title and count number of intersection between query and title for each category.
  • Counting features: Length of query, number of common grams between query and title, Jacquard similarity, etc.
  • Co-occurrence features: Measures of how frequently words appear together. e.g., Latent Semantic Analysis (LSA).
  • Semantic features: Measure how similar the meaning of two words is.
  • Statistical features: Compare queries with unknown score to queries with known relevance score.

It seems that a lot of the top teams had similar types of features, but the implementation details are probably different. For our ensemble we used different variations of xgboost along with a ridge regression model.

Word cloud of Home Depot product search terms.

Word cloud of Home Depot product search terms.

For models and ensemble we started with random forest, extra trees and gbm-models. Furthermore xgboost and ridge were in our focus. Shortly prior to the end of the competition we found out, that first random forest and then extra trees did not help our ensembles anymore. So we focused on xgboost, gbm and Ridge.
Our best single model was a xgboost-model and scored 0.43347 on the public LB. The final ensemble consists of 19 models based on xgboost, gbm and Ridge. The xgboost-models were made with different parameters including binarizing the target, objective reg:linear, and objective count:poisson. We found, that the Ridge Regression helped in nearly every case, so we included it in the final ensemble.

Our data processing pipeline.

Our data processing pipeline.

Were you surprised by any of your findings?

A surprising finding was the large number of features which had predictive ability. In particular, when we teamed up, it was better to combine our features than to ensemble our results. This is quite unique as most of the time new features are more likely to cause overfit but not in this case. As a result, adding more members to the team was highly likely to improve score which is why the top-10 were all teams of at least 3 people.

Which tools did you use?

We used mainly Python 3 and Python 2. The decision for Python 2 is interesting as some of the used libraries are still not available for Python 3. In our processing chain we used the Python standard tools for machine learning (scikit-learn, nltk, pandas, numpy, scipy, xgboost, gensim). Nima used R for feature generation.

How did you spend your time on this competition?

After teaming up, Sean and Nima spent most of their time on feature engineering and Thomas and Qingchen spent most of their time on model tuning.

What was the run time for both training and prediction of your winning solution?

In general, training/prediction time is very fast (minutes), but we used some xgboost parameters that took much longer to train (hours) for small performance gains. Text processing and feature engineering took a very long time (easily over 8 hours for a single feature set).

Words of wisdom

What have you taken away from this competition?

First of all quite a lot of Kaggle ranking points and Thomas got his Master badge! Overall this was a very difficult competition and we learned a lot about natural language processing and information retrieval in practice. It now makes sense why Google is able to use such a large number of features in their search algorithm as many seemingly insignificant features in this competition were still able to provide a tangible performance boost.

Teamwork

How did your team form?

Initially Thomas and Sean teamed up as Sean had strong features and Thomas experience in models and Kaggle. The models were complementing well and ensembling brought the team into the top-10. A further boost was made when Qingchen joined with his features and models. At this point we (and other teams) realized that it's a necessity to form larger teams in order to be competitive as combining features really helps improve performance. We decided to ask Nima to join us as he had an excellent track record and was also doing quite well on his own.

Working together was quite interesting as we are from Germany, US, Netherlands and Canada. The different time zones made direct communication difficult; we opted therefore for mail communication. For getting results and continue working on ideas the different time zones were helpful.

Team bios

Dr. Thomas Heiling is a pharmacist, with his PhD in Informatics and Pharmaceutical Analytics and works in Quality in the pharmaceutical industry.

Sean J. Vasquez is a second year undergraduate student at the Massachusetts Institute of Technology (MIT), studying computer science and mathematics.

Qingchen Wang is a Data Scientist at ORTEC Consulting and a PhD researcher in Data Science and Marketing Analytics at the Amsterdam Business School.

Nima Shahbazi is a second-year PhD student in the Data Mining and Database Group at York University. He previously worked in big data analytics, specifically on Forex Market. His current research interests include Mining Data Streams, Big Data Analytics and Deep Learning.

Avito Duplicate Ads Detection, Winners' Interview: 3rd Place, Team ADAD | Mario, Gerard, Kele, Praveen, & Gilberto

$
0
0
Avito Duplicate Ads 3rd Place Winners Interview

The Avito Duplicate Ads Detection competition ran on Kaggle from May to July 2016 and attracted 548 teams with 626 players. In this challenge, Kagglers sifted through classified ads to identify which pairs of ads were duplicates intended to vex hopeful buyers. This competition, which saw over 8,000 submissions, invited unique strategies given its mix of Russian language textual data paired with 10 million images. In this interview, team ADAD describes their winning approach which relied on feature engineering including an assortment of similarity metrics applied to both images and text.

The basics

What was your background prior to entering this challenge?

Mario Filho: My background in machine learning is completely “self-taught”. I found a wealth of education materials available online through MOOCs, academic papers and lectures in general. Since February 2014 I have worked as a machine learning consultant in projects from small startups and Fortune 500 companies.

Gerard Toonstra: I worked as a scientific developer at Thales for 3 years, which introduced me into more scientific development methods and algorithms. Most of the specific ML knowledge was acquired through courses on Coursera and just getting your hands dirty in Kaggle competitions and the forum interactions.

Kele Xu: I am a PhD student with the topic on "silent speech interface".

Praveen Adepu: Academically, I have Bachelor of Technology and working as full stack BI Technical Architect/Consultant.

Gilberto Titericz: I'm graduated in electronic engineering and M.S. in wireless communication area. In 2011 I started to learn data science by myself and after joining Kaggle I started to learn even more.

How did you get started competing on Kaggle?

Mario Filho: I heard about Kaggle when I was taking my first courses about data science, and after I learned more about it I decided to try some competitions.

Gerard Toonstra: I was active in the Netflix grand prize quite some while ago and at the end, it pointed to the Kaggle site as another potential source of getting your hands dirty. I ignored that up to July last year when I decided to start on the Avito click challenge. It's pretty cool that exactly one year later, after some avid kaggling, I'm part of the 3rd place submission.

Kele Xu: I participated in KDD Cup 2015, and ended as 40th there. That was my first competition. After KDD Cup 2015, I became a Kaggler, which helped me to learn a lot during the last year.

Praveen Adepu: I am new to machine learning, R and Python. I like learning by doing and I realised Kaggle is the best fit for my learning by participating in competitions. Initially I experimented with few competitions to find learning patterns and started working seriously from last 6 months and I learnt a lot in the past 6 months and looking forward to learn more from Kaggle.

Gilberto Titericz: After the Google AI challenge 2011 I was searching on the internet for other online competition platform and found Kaggle.

What made you decide to enter this competition?

Mario Filho: At the time it was the only competition that had a reasonable dataset size and was not too focused on images. So I thought it would be possible to get a good result with feature engineering and the usual tools.

Gerard Toonstra: I feel that my software engineering background can give me an edge in certain competitions. The huge amount of data requires a bit of planning and modularizing the code. I started doing that in the Dato competition, continued doing it a bit better for Home Depot and in Avito I started some more serious pipelining and feature engineering. It's not that the pipelines are sophisticated, the only purpose is to reduce feature building time. Instead of waiting for one script to finish in 8 hours, I just build features in parts and glue/combine them together.

Kele Xu: When I decide to go to a new competition, I would like to select some topic which I have no experience before. On that case, I will take more from the competition. In fact, before this competition, I have few experiments on NLP topics. That’s the main reason I entered this competition.

Praveen Adepu:

  1. I like feature engineering, this completion requires lot of hand craft feature engineering
  2. Very high LB bench mark score attracted me a lot to test my learning skills from previous competitions.
    I left with lot of feature engineering ideas even after passing the bench mark in couple of weeks so planned to work bit more time on this competition.

Gilberto Titericz: Lately I have interest in competitions involving image processing and deep learning.

Let's Get Technical

What preprocessing and supervised learning methods did you use?

Mario Filho: I stemmed and cleaned the text fields, then used tf-idf to compute similarities between them. Used the hash similarity script available in the forums to compute the similarity between images. After creating and testing lots of features, I used XGBoost to train a model.

Gerard Toonstra: I had one feature array for textual features, which is essentially pretty much what others did in the forum. Then I built another feature set with additional text features, a minhash similarity feature, tf-idf+svd over description and then I did work on image hashes: ahash,phash,dhash with min, max, avg features and the count of zero divided by max number of images. I kept focusing on making these normalized distance metrics, so divide by 'length' where appropriate or the number of images. As things advanced, I also did image histograms on RGB, histograms on HSV and Earth Mover Distance between images.

I ended up only using four model types: logistic regression over tf-idf features over cleaned text, XGB, Random Forests and FTRL.

Kele Xu: Here, I am only using XGBoost. XGBoost is really competitive here. My best single XGBoost model can get to top 14 in both of public and private Leaderboard.

Praveen Adepu: Not any special pre-processing methods but just followed all best practices of feature engineering and taken care of processing times when creating new features.
Used - XGBoost, h2o Random Forest, h2o deep learning, Extra Trees and Regularised Greedy Forest.

Gilberto Titericz: Most preprocessing was done in text features, like stop words removal and stemming. Also I build a very interesting feature using deep learning. For this I used the MXNet Inception 21k pre-trained model to predict one class for each one of the 10M images in dataset. Measuring some statistics between those classes helped improve our models.

Supervised learning methods used are based in gradient boosting (XGBoost), RandomForest, ExtraTrees, FTRL, Linear Regression, Keras and MXNet. We also used libffm and RGF algorithms, but we dropped it in the end.

What have you taken away from this competition?

Mario Filho: I learned new techniques for calculating image similarity based on hashes, and 2 or 3 Russian words.

Gerard Toonstra: The benefit of working in teams and learning from that experience. I really wanted to learn about stacking. There is a good description on mlwave.com, but when you get around to actually doing that in practice, there are still questions that pop up. The other observation is that as a competition draws to a close, it's all hands on deck and you have to put the last effort in, especially the last week. I did not have enough hardware resources and used in the most extreme case three 16-cpu machines spread across three geographical zones on Google Cloud. Despite the hardware challenges and exploration we had to do, I noticed some solo competitors who were able to gain ground quickly and rocket up. I really respect that. It tells me that those guys have acquired so much experience that they consistently make right decisions on what to spend time on and I still have a long way to go.

Kele Xu: I have learned a lot from our teammates, especially on the hyper-parameter optimization for XGBooost model and the ensemble techniques. Also, I got to know how to do some NLP tasks.

Praveen Adepu: I think joining the team at right time and with right team members is best decision I have made in this competition.

I started with two goals while joining the team:

  1. Learn from the team while contributing to the best of my knowledge
  2. Learn ensembling/stacking

I would like to take this opportunity to say thank you my entire team Gilberto, Mario, Gerard and Kele.

Gilberto - thank you for teaching many many concepts including stacking
Mario - thank you for guiding me while experimenting many models
Gerard - never forget our initial discussion on feature engineering
Kele - thank you for the invite to join the team and making this happen

I learnt a lot from this team in one month than alone in the past 6 months and looking forward to work with them in near future.

Gilberto Titericz: I learned a lot about text and image distance metrics.

Do you have any advice for those just getting started in data science?

Mario Filho: if you plan to apply machine learning, try to understand the fundamentals of the models, concepts like validation, bias and variance. Really understand these concepts and try to apply them to data.

Gerard Toonstra: The first thing is to understand cross validation score and its specific relation to the leaderboard in that competition and how to establish a sound folding strategy in general. After that, I recommend establishing a very simple model based on features that do not overfit and evaluate the behavior. Then add a couple of features one by one that are unlikely to overfit and keep evaluating. Record the difference between CV and LB when you make submissions. As you establish confidence in your local CV, start working out more features and do a couple of in-between checks to ensure you're not overfitting locally; especially depending on how some aggregation features are built, they can have overfitting effects.

When you get started, don't allow yourself to get bogged down by endless hours of parametrizing and hyperparameterizing the models. You may get the extra +0.00050 to jump up 3 positions, but it's not what most on the LB are doing. Figure out how to create new informative features, which features should be avoided and when you grow tired of that, only then spend some time optimizing what you have.

Kele Xu: In fact, I think Kaggle is quite a good platform for both the data scientists and those who are just getting started, as Kaggle can provide a platform for the new guys to test how each algorithm performs.

As to the technique suggestions, I would say a solid local CV is the key point to win a competition. Usually, I will spend 2-4 weeks to test my local CV.

Feature engineering is more important than hyper-parameter optimization.

Although XGBoost is enough to get a relative good rank in some competitions, we should also test mode models to add the diversity of the models, not just select different seed or set different max depth for tree-based methods.

The last thing is: just test your ideas, and you can get some feedback from the LB. When you get to higher rank in one competition, you will have the motivation to make it better.

Praveen Adepu: I think, I do call these my experience rather than advice:

  1. Start with clear objectives, learn slowly and progressively with basics and learn quickly with advanced concepts and end with mastering
  2. Never be afraid to start and fail
  3. Clear understanding of local CV, underfit and overfit
  4. Start learning from feature engineering and end with stacking

Gilberto Titericz: Read a lot of related material. Search for problems to solve. Read about problems' solutions. And make your fingers work, program and learn from your mistakes.

Bios

Mario Filho is a machine learning consultant focused in helping companies around the world use machine learning to maximize the value they get from data to achieve their business goals. Besides that, he mentors individuals who want to learn how to apply machine learning algorithms to real world data sets.

Gerard Toonstra graduated as a nautical officer+engineer, but mostly worked as software engineer and started his own company in Brazil working with drones for surveying. He now works as scrum master for the BI department at Coolblue in The Netherlands.

Kele Xu is a PhD student and writing his PhD thesis at Langevin Institute (University of Pierre and Marie Curie). His main interests are include: silent speech recognition, machine learning and computer vision.

Praveen Adepu is currently working as BI Technical Architect/Consultant at Fred IT Melbourne based IT product company and main interests in machine learning and data architecture.

Gilberto Titericz is an electronics engineer with a M.S. in telecommunications. For the past 16 years he's been working as an engineer for big multinationals like Siemens and Nokia and later as an automation engineer for Petrobras Brazil. His main interests are in machine learning and electronics areas.


Kernel Corner

Getting hashes from images was an important strategy in detecting similarities across the 10 million images in this competition as Gerard Toonstra explains:

phash calculates a hash in a special way such that it reduces the dimensionality of the source image into a 64-bit number. It captures the structure of the image. When you then subtract two hashes, you get a number that resembles the 'structural distance' between the two images. The higher the number, the more the distance between the two.

Kaggler Run2 shared code on Kaggle Kernels which allowed Kagglers to incorporate distance metrics from images without performing heavy duty image processing or deep learning.

Get hash from image script on Kaggle Kernels

Part of Kaggler Run2's script Get Hash From Images.


Facebook V: Predicting Check Ins, Winner's Interview: 2nd Place, Markus Kliegl

$
0
0
Facebook_V_Predicting-CheckIns

Facebook ran its fifth recruitment competition on Kaggle, Predicting Check Ins, from May to July 2016. This uniquely designed competition invited Kagglers to enter an artificial world made up of over 100,000 places located in a 10km by 10km square. For the coordinates of each fabricated mobile check-in, competitors were required to predict a ranked list of most probable locations. In this interview, the second place winner Markus Kliegl discusses his approach to the problem and how he relied on semi-supervised methods to learn check-in locations' variable popularity over time.

The basics

What was your background prior to entering this challenge?

I recently completed a PhD in mathematical fluid dynamics. Through various courses, internships, and contract work, I had some background in scientific computing, inverse problems, and machine learning.

Let's get technical

What preprocessing and supervised learning methods did you use?

The overall approach was to use Bayes' theorem: Given a particular data point (x, y, accuracy, time), I would try to compute for a suitably narrowed set of candidate places the probability

    \[P(place | x, y, accuracy, time) \propto P(x, y, accuracy, time | place) P(place) \,,\]

and rank the places accordingly. A la Naive Bayes, I further approximated

P(x, y, accuracy, time | place) as

    \[P(x, y, accuracy, time | place) \approx P(x, y | place) \cdot\]

    \[P(accuracy | place) \cdot P(time\, of\, day | place) \cdot\]

    \[P(day\, of\, week | place) \,.\]

I decided on this decomposition after a mixture of exploratory analysis and simply trying out different assumptions on the independence of variables on a validation set.

One challenge given the data size was to efficiently learn the various conditional distributions on the right-hand side. Inspired by the effectiveness of ZFTurbo's "Mad Scripts Battle" kernel early in the competition, I decided to start by just learning these distributions using histograms.

To make the histograms more accurate, I made them periodic for time of day and day of week and added smoothing using various filters (triangular, Gaussian, exponential). I also switched to C++ to further speed things up. (Early in the competition this got me to the top of the leaderboard with a total runtime of around 40 minutes single-threaded, while others were already at 15-50 hours. Unfortunately, I could not keep things this fast for very long.)

For later submissions, I averaged the P(x, y | place) histograms with Gaussian Mixture Models.

What was your most important insight into the data?

The relative popularity of places, P(place), varied substantially over time (really it should be written as P(place, time)), and it seemed hard tome to forecast it from the training data (though others like Jack (Japan) in third place had some success doing this). Since the quality of the predictions even with a rough guess for P(place) was already fairly high, however, I realized a semi-supervised approach might stand a good chance of being able to learn P(place, time). My final solution performed 20 semi-supervised iterations on the test data.

Number of check-ins over time varied irregularly for many places.

The number of checkins over time varied quite irregularly for many places. A semi-supervised approach helped me overcome this irregularity.

Getting this to actually work well took some effort. There is more discussion in this thread.

Were you surprised by any of your findings?

Accuracy was quite mysterious at first. I initially focused on analyzing the relationship between accuracy and the uncertainty in the x coordinate and tried to incorporate that into my model. However, this helped only a tiny bit. I eventually came to the conclusion that accuracy is most gainfully employed directly by adding a factor P(accuracy | place): different places attract different mixes of accuracies. As suggested in the forums, this makes sense if one thinks of accuracy as a proxy for device type.

Another surprise was this: On the last day, I tried ensembling different initial guesses for P(place), but this improved the score only by 0.00001 over the best initial guess, which in turn was only 0.00015 better than the worst initial guess. Though I was disappointed to not be able to improve my score in this way (rushed experiments on a small validation set had looked a little more promising), this insensitivity to the initial guess is actually a good property of the solution. It speaks to the stability of convergence of the algorithm.

Which tools did you use?

Apparent multinomial distributions of check-ins.

The (x, y) distributions of checkins looked multimodal. KDE or Gaussian Mixture Models were thus natural to try for learning P(x, y | place).

I used Python with the usual stack (pandas, matplotlib, seaborn, numpy, scipy, scikit-learn) for data exploration and for learning Gaussian Mixture Models for the P(x, y | place) distributions. The main model is written in C++. Finally, I used some bash scripts and the GNU parallel utility to automate parallel runs on slices of the data.

How did you spend your time on this competition?

I spent a little time early on exploring the data, in particular doing case studies of individual places. After that, I spent almost all my time on implementing, optimizing, and tuning my custom algorithm.

What was the run time for both training and prediction of your winning solution?

Aside from one-time learning of Gaussian Mixture Models (which probably took around 40 hours), the run time was around 60 CPU hours. Since the problem parallelizes well, the non-GMM run time was about 15 hours on my laptop. For the last few days of the competition, I borrowed compute time on an 8-core workstation, where the run time ended up at around around 4-5 hours.

In this Github repository, I also posted a simplified single-pass version that would have gotten me to 6th place and that runs in around 90 minutes single-threaded on my laptop (excluding the one-time GMM training time). Compared to my full solution, this semi-supervised online learning version also has the nicer property of never using any information from the future.

Bio

Markus Kliegl profile photo
Markus Kliegl recently completed a PhD in Applied and Computational Mathematics at Princeton University. His current interests lie in machine learning research and applications.

Facebook V: Predicting Check Ins, Winner's Interview: 1st Place, Tom Van de Wiele

$
0
0
facebook_v

From May to July 2016, over one thousand Kagglers competed in Facebook's fifth recruitment competition: Predicting Check-Ins. In this challenge, Kagglers were required to predict the most probable check-in locations occurring in artificial time and space. As the first place winner, Tom Van de Wiele, notes in this winner's interview, the uniquely designed test dataset contained about one trillion place-observation combinations, posing a huge difficulty to competitors. Tom describes how he quickly rocketed from his first getting started competition on Kaggle to first place in Facebook V through his remarkable insight into data consisting only of x,y coordinates, time, and accuracy using k-nearest neighbors and XGBoost.

The basics

What was your background prior to entering this challenge?

I have completed two Master programs at two different Belgian universities (Leuven and Ghent), one in Computer Science (2010) and one in Statistics (2016). I graduated from the Statistics program during the Kaggle competition and was combining it with a full time job at a manufacturing plant at Eastman in Ghent during the past couple of years. Initially I started as an automation engineer and in a next phase I was mostly working on process improvements using the six sigma methodology. In the beginning of 2015 I got to the really good stuff when I started working with the data science group at Eastman where I am currently employed as an analytics consultant. We solve various complex analysis problems with an amazing team and mostly rely on R which we often combine with Shiny to develop interactive web applications.

Tom Van de Wiele

Tom Van de Wiele on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I have always had a passion for modeling complex problems and think that this mindset helped me to do well more than anything else. The problem setting is very tangible and all four predictors can be interpreted by anyone so that made it a very accessible contest where mobile data domain knowledge doesn’t really help. The problem can be translated to a classification setting with the only major complication that there are a large number of classes (>100K). I did however read a lot about other winning solutions prior to the contest. The learning from the best post on this blog has been especially useful.

How did you get started competing on Kaggle?

Through a Kaggle ‘Getting Started’ competition 🙂 I was a passive user for a long time before entering my first competition. Like many others I wanted to compete one day but never really took the step to my first submission. Things changed when a colleague with a chemical engineering background wanted to get into machine learning and participated in the Kobe Bryant shot selection competition. He asked some great questions and I tried to point him into the right direction but his questions got me excited enough to download the data and implement my suggestions. Two evenings later I got close to the top 10 on the leaderboard at the time with about 500 participants and I started to dream about future competitions. That second evening was the launch date of the Facebook V competition and I wouldn’t have to dream for long!

What made you decide to enter this competition?

The promise of a possible interview at Facebook was a strong motivation to participate although I considered it to be highly unlikely given that it was my first featured Kaggle competition and I already had a fully booked agenda. My second main motivation was the promise of learning new techniques and insights from other contestants.

In hindsight I am very happy to be interviewing at one of the best companies in the world for machine learning professionals but I am even more grateful for everything I learned from my own struggles and the other participants. A competition setting makes you think outside of the box and continuously challenge your approach. The tremendous code sharing on the forums was a great catalyst in this process.

Let's get technical

Extended details of the technical approach can be found on my blog. The R code is available on my GitHub account along with high level instructions to construct the final submission.

What was your general strategy?

The main difficulty of this problem is the extended number of classes (places). With 8.6 million test records there are about a trillion (10^12) place-observation combinations. Luckily, most of the classes have a very low conditional probability given the data (x, y, time and accuracy). The major strategy on the forum to reduce the complexity consisted of calculating a separate classifier for many x-y rectangular grids. It makes much sense to make use of the spatial information since this shows the most obvious and strong pattern for the different places. This approach makes the complexity manageable but is likely to lose a significant amount of information since the data is so variable. I decided to model the problem with a single stacked two level binary classification model in order to avoid to end up with many high variance models. The lack of any major spatial patterns in the exploratory analysis supports this approach.

Generating a single classifier for all place-observation combinations would be impractical even with a powerful cluster. My approach consists of a stepwise strategy in which the place probability (the target class) conditional on the data is only modeled for a set of place candidates. A simplification of the overall strategy is shown below:

strategy

The overall strategy used

The given raw train data is split in two chronological parts, with a similar ratio as the ratio between the train and test data. The summary period contains all given train observations of the first 408 days (minutes 0-587158). The second part of the given train data contains the next 138 days and will be referred to as the train/validation data from now on. The test data spans 153 days as mentioned before.

The summary period is used to generate train and validation features and the given train data is used to generate the same features for the test data.

The three raw data groups (train, validation and test) are first sampled down into batches that are as large as possible but can still be modeled with the available memory. I ended up using batches of approximately 30,000 observations on a 48GB workstation. The sampling process is fully random and results in train/validation batches that span the entire 138 days’ train range.

Next, a set of models using 430 numeric features is built to reduce the number of candidates to 20 using 15 XGBoost models in the second candidate selection step. The conditional probability P(place_match|features) is modeled for all ~30,000*100 place-observation combinations and the mean predicted probability of the 15 models is used to select the top 20 candidates for each observation. These models use features that combine place and observation measures of the summary period.

The same features are used to generate the first level learners. Each of the 100 first level learners are again XGBoost models that are built using ~30,000*20 feature-place_match pairs. The predicted probabilities P(place_match|features) are used as features of the second level learners along with 21 manually selected features. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.

All models are built using different train batches. Local validation is used to tune the model hyperparameters.

What was your most important insight into the data?

I think I had a good insight into several of the accuracy related patterns. The accuracy distribution seems to be a mixed distribution with three peaks which changes over time. It is likely to be related to three different mobile connection types (GPS, Wi-Fi or cellular). The places show different accuracy patterns and features were added to indicate the relative accuracy group densities. The middle accuracy group was set to the 45-84 range. I added relative place densities for 3 and 32 approximately equally sized accuracy bins.

accuracy groups

Mean variation from the median in x versus 6 time and 32 accuracy groups.

It was also discovered that the location is related to the three accuracy groups for many places. This pattern was captured by the addition of additional features for the different accuracy groups.

Location and accuracy groups

The x-coordinates seem to be related to the accuracy group for places like 8170103882.

Studying the places with the highest daily counts also pointed me towards obvious yearly patterns which were translated to valuable features. The green line in the image below goes back 52 weeks since the highest daily count.

places with the highest daily counts

Clear yearly pattern for place 5872322184. The green line goes back 52 weeks since the highest daily count.

Were you surprised by any of your findings?

The strength of K nearest neighbors was remarkable in this problem. Nearest neighbor features make up a large share of my solution and the leading public script relied on the K nearest neighbor classifier. I was also surprised that I couldn’t find clear spatial patterns in the data (e.g. a party district).

Which tools did you use?

All code was implemented in R and I created an Rcpp package to address the major bottleneck using C++. The most important package I used was by far the data.table package. I was not familiar with the syntax heading into the competition but going through the trouble of learning it enabled me to handle the dimensions of the problem. Other critical tools are the xgboost package and the doParallel package. The exploratory data analysis lead to a Shiny application which was shared with the other participants.

How did you spend your time on this competition?

I was forced to spend the first 10 days of the competition thinking about possible high level approaches due to other priorities and ended up with an approach that strongly resembled my final framework. The next 10 days were used to generate about 50 features and build the framework except for the second level learners. This intermediate result got me to the first spot on the public leaderboard and encouraged me to expand the feature set. I spent most of the remaining time on detailed feature engineering and started building the second tier of the binary classifier three weeks before the end of the contest. The last two weeks were mostly dedicated to hyperparameter optimization.

What was the run time for both training and prediction of your winning solution?

Running all steps to train the model and generate the final submission would take about a month on my 48GB workstation. That seems like a ridiculously long time but it is explained by the extended computation time of the nearest neighbor features. While calculating the NN features I was continuously working on other parts of the workflow so speeding the NN logic up would not have resulted in a better final score.

Words of wisdom

What have you taken away from this competition?

I learned a lot from the technical issues I ran into but have learned most from the discussions on the forum. It is great to learn from brilliant people like Markus. The way he used semi-parametric learning to learn from the future was an eye-opener. Many others made significant contributions but it was especially useful to learn from Larry Freeman and Ben Hamner that we are better when we work together. An ensemble of top solutions can do much better than my winning submission!

private leaderboard scores

Private leaderboard score (MAP@3) - two teams stand out from the pack.

Do you have any advice for those just getting started in data science?

I would suggest to start with a study of various data science topics. Andrew Ng’s course is an excellent place to start. Getting your hands dirty with appropriate feedback is the next step if you want to get better. Kaggle is of course an excellent platform to do so. I am very impressed with the quality and general atmosphere on the forum and would suggest everyone to start competing!

Bio

tom_van_de_wiele
Tom Van de Wiele recently completed his master of statistical data analysis at the University of Ghent. Tom has a background in computer science engineering and works in the data science group of Eastman as an Analytics Consultant where he works on various complex data challenges. His current interests lie in applied machine learning and statistics.

Facebook V: Predicting Check Ins, Winner's Interview: 3rd Place, Ryuji Sakata

$
0
0
facebook_V-3rd

The Facebook recruitment challenge, Predicting Check Ins, ran from May to July 2016 attracting over 1,000 competitors who made more than 15,000 submissions. Kagglers competed to predict a ranked list of most likely check-in places given a set of coordinates. Using just four variables, the real challenge was making sense of the enormous number of possible categories in this artificial 10km by 10km world. The third place winner, Ryuji Sakata, AKA Jack (Japan), describes in this interview how he tackled the problem using just a laptop with 8GB of RAM and two hours of run time.

The basics

What was your background prior to entering this challenge?

I'm working as a data scientist at an electronics company in Japan. My major at university (Aeronautics and Astronautics) is actually far from my current job, and I didn't have knowledge about data science at all. I got to know this field after starting work, and keep learning until now.

How did you get started competing on Kaggle?

I joined Kaggle about two and a half years ago in order to learn data mining in practice. The first competition I attended was Allstate Purchase Prediction Challenge and then I realized the fun of data mining. Since then, I spent much of my spare time on Kaggle. Competing with other Kagglers is always very exciting and it has been a good learning experience!

What made you decide to enter this competition?

This competition seemed to be something special compared with other challenges. The objective variable we had to predict has so many categories, and the number of variables we can use to predict is only 4! I was attracted to the concept that this competition was testing Kaggler's ingenious ideas.

Let's get technical

What preprocessing and supervised learning methods did you use?

My basic idea is very similar to that of Markus Kliegl as written in his Winner's Interview, which is Naive Bayes approach.

naive_bayes

What I had to do is to find place which maximizes above probability. Brief steps of my approach are as follows:

  1. To narrow down candidate places for each record, I counted check-ins of each place using 100 x 200 grid, and places which have at least 2 check-ins are treated as candidates of each cell.
  2. Estimate p(place_id) from past trend of check-in frequency by xgboost regression (refer to Figure A).
  3. Estimate each distribution by using histogram (refer to Figure B).
  4. Calculate the product of probabilities of all candidates for each record.
  5. Select top 3 places which have high probability and create submission.

Regarding the second step, it seemed that time variation of check-in frequency is quite different between places, but I believe that some patterns of trend exists. So I decided to predict the number of future check-ins of each place from history of all places. The concept is as shown in the figure below.

Concept

Figure A

Regarding the third step, I also illustrated the concept of estimation in the figure below.

Estimation

Figure B

There are some additional notes regarding the above figure:

  • After counting check-ins for each bin, counted values are smoothed by using neighbor bins.
  • About "x", "y", and "accuracy", the concept of time decay was introduced by multiplying weight when counting check-ins in the following manner.
  • time_decay

  • About "y", it seemed that distribution follows a normal distribution for many places, so I also estimated the center and standard deviation of places and calculated the probability from that.
  • About "time of day", "day of week", and "accuracy", small constant (pseudocount) is added after counting check-ins in order to avoid probability becoming zero.

Simple version of my solution is uploaded on my kernel here.

How did you spend your time on this competition?

The approach was almost fixed in the early stage of competition, and I conducted trials and errors many and many times to maximize the local validation score. I spent much time to optimize hyperparameters to estimate distributions, such as:

  • The number of bins for each variable
  • How to smooth the distribution (number of neighbors and weight)
  • How to decay the counting weight according to time elapse
  • What pseudocount should be added

At the end of competition, however, I couldn’t improve the score by changing the above parameters, so I shifted to improving the precision of p(place_id).

Which tools did you use?

I always use R for Kaggle competitions just because I am familiar with it. However, I would like to master python too in future.

What was the run time for both training and prediction of your winning solution?

My solution takes only about 2 hours by 8GB RAM laptop, and maybe this is the most prominent feature of my approach. Some other competitors seemed to apply xgboost for many small cells, and took very long time, but it was quite unbearable to me! I decided to focus on achieving both accuracy and speed. Thanks to that, I can conduct the trial and error method so many times.

Words of wisdom

What have you taken away from this competition?

Through this great competition, I gained a little more confidence in this field, but at the same time, I was surprised by other Kagglers' ideas which I would have never came up with. From this experience, I realized there are still many things I have to learn. So I would like to keep learning.

Do you have any advice for those just getting started in data science?

I am still a beginner in data science, but if there is one thing to say, “What one likes, one will do well.” For me, it is Kaggle. I always enjoy competing and communicating with other Kagglers!

Bio

Ryuji Sakata works at Panasonic group as a data scientist. He has been involved in data science for about 3 years. He holds a master's degree in Aeronautics and Astronautics at Kyoto University.

Avito Duplicate Ads Detection, Winners' Interview: 1st Place Team, Devil Team | Stanislav Semenov & Dmitrii Tsybulevskii

$
0
0
Avito Duplicate Ads Competition

The Avito Duplicate Ads Detection competition ran from May to July 2016. This competition, a feature engineer's dream, challenged Kagglers to accurately detect duplicitous duplicate ads which included 10 million images and Russian language text. In this winners' interview, Stanislav Semenov and Dmitrii Tsybulevskii describe how their single XGBoost model scores among the top three and their ensemble snagged them first place. Stanislav's third Avito competition was a special one, too; his first place win as part of Devil Team boosted him to #1 Kaggler status!

The basics:

What was your background prior to entering this challenge?

Dmitrii Tsybulevskii: I hold a degree in Applied Mathematics, and I’ve worked as a software engineer on computer vision and machine learning projects.

Stanislav Semenov: I hold a Master's degree in Computer Science. I've worked as a data science consultant, teacher of machine learning classes, and quantitative researcher.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Dmitrii Tsybulevskii: Yes, I’ve worked on image duplicate detection and text classification problems before, and I know the Russian language.

Stanislav Semenov: This is my 3rd Avito competition on Kaggle! And yes, I know the Russian language, too.

What made you decide to enter this competition?

Dmitrii Tsybulevskii: A lot of raw data, both text and images - large field for feature engineering, and I like feature engineering.

Stanislav Semenov: A large area for feature engineering.

Let’s get technical:

What preprocessing and supervised learning methods did you use?

It was all about feature engineering. So we tried to generate as many strong features as we could. XGBoost was the only learning method used. Our single XGBoost model can get to the top three! Our final model just averaged XGBoost models with different random seeds.

We used following text preprocessing:

  • stemming
  • lemmatization
  • transliteration

Our features:

  • different similarity features between title-title, title-description, title-json like Cosine distance, Levenshtein, Jaccard, NCD, etc
  • different features of exact match of words in title, description
  • general features such as prices, places, number of images, exact match of title, description, etc
  • different similarity features of trained w2v models
  • LSI features of ads union, ads XOR
  • one-hot-encoding of categoryID
  • ratios of the title, description, json lengths
  • distances between BRIEF image descriptors
  • distances between color histogram in LAB space, HOG histograms
  • distances between features, extracted with pretrained neural network MXNet BN-Inception-21k, first averaged PCA components of this features
  • number of matches computed with AKAZE local visual feature detector & descriptor

The most important trick was to submit our best result before 2 hours of competition ending. That was EXTREMELY fun! =)

Did knowing Russian help you in this competition? If so, how?

Stanislav Semenov: Not so much. Of course, you can see where your model is wrong and a close look at the ads. But it did not give any new information.

Dmitrii Tsybulevskii: On the one hand it was comfortable to work with Russian texts, because you know what the ads were about. On the other hand we had no killer features based on it.

Which tools did you use?

Jupyter Notebook, XGBoost, Pandas, scikit-learn, VLFeat, OpenCV, MXNet

What was the run time for your winning solution?

Feature extraction: 3-4 days
Model training: 1-2 weeks

Words of wisdom:

What have you taken away from this competition?

Dmitrii Tsybulevskii: I have learned about NCD distance and some convenient things about team cooperation.

Stanislav Semenov: A lot of fun and much needed ranking points. 😉

Do you have any advice for those just getting started in data science?

Stanislav Semenov: Solving practical problems is your best friend.

Dmitrii Tsybulevskii: Kaggle is a great platform for getting new knowledge.

Bio

Stanislav Semenov is a Data Scientist and Quantitative Researcher.

Dmitrii Tsybulevskii is a software engineer. He holds a degree in Applied Mathematics. His main interests are computer vision and machine learning.

Avito Duplicate Ads Detection, Winners' Interview: 2nd Place, Team TheQuants | Mikel, Peter, Marios, & Sonny

$
0
0
Avito Duplicate Ads

The Avito Duplicate Ads competition ran on Kaggle from May to July 2016. Over 600 competitors worked to feature engineer their way to the top of the leaderboard by identifying duplicate ads based on their contents: Russian language text and images. TheQuants, made up of Kagglers Mikel, Peter, Marios, & Sonny, came in second place by generating features independently and combining their work into a powerful solution.

In this interview, they describe the many features they used (including text and images, location, price, JSON attributes, and clustered rows) as well as those that ended up in the "feature graveyard." In the end, a total of 587 features were inputs to 14 models which were ensembled through the weighted rank average of random forest and XGBoost models. Read on to learn how they cleverly explored and defined their feature space to carefully avoid overfitting in this challenge.

The basics

What was your background prior to entering this challenge?

Mikel Bober-Irizar: Past Predictive modelling competitions, financial predictions and medical diagnosis.

Peter Borrmann: Ph.D. in theoretical physics, research assistant professor as well as previous Kaggle experiences.

Marios Michailidis: I am a Part-Time PhD student at UCL , data science manager at dunnhumby and fervent Kaggler.

Sonny Laskar: I am an Analytics Consulting Manager with Microland working on implementing Big Data Solutions; mostly dealing with IT Operations data.

How did you get started with Kaggle?

Mikel Bober-Irizar: I wanted to learn about machine learning and use that knowledge to compete in competitions.

Peter Borrmann: I wanted to improve my skillset in the field.

Marios Michailidis: I wanted a new challenge and learn from the best.

Sonny Laskar: I got to know about Kaggle few years back when I was pursuing my MBA.

The Quants

The Quants Team

Summary

Our approach to this competition was divided into several parts :

  1. Merging early based on standing of the leaderboard.
  2. Generate features independently (on cleaned or raw data) that would potentially capture the similarity between the contents of 2 ads and could be further divided to more categories (like text similarities or image similarities).
  3. Build a number of different classifiers and regressors independently with a hold out sample.
  4. Combine all members' work
  5. Ensemble the results through weighted rank average of a 2-layer meta-model network ( StackNet).
summary

Approach summary.

Data Cleaning and Feature Engineering

Data Cleaning

In order to clean the text, we applied stemming using the NLTK Snowball Stemmer, and removed stopwords/punctuation as well as transforming to lowercase. In some situations we also removed non-alphanumeric characters too.

Feature Engineering vol 1: Features we actually used

In order to pre-emptively find over-fitting features, we built a script that looks at the changes in the properties (histograms and split purity) of a feature over time, which allowed us to quickly (200ms/feature) identify overfitting features without having to run overnight XGBoost jobs.

After removing overfitting features, our final feature space had 587 features derived from different themes:

General:

  • CategoryID, parentCategoryID raw CategoryID, parentCategoryID one-hot (except overfitting ones).
  • Price difference / mean.
  • Generation3probability (output from model trained to detect generationmethod=3).

Location:

  • LocationID & RegionID raw.
  • Total latitude/longtitude.
  • SameMetro, samelocation, same region etc.
  • Distance from city centres (Kalingrad, Moscow, Petersburg, Krasnodar, Makhachkala, Murmansk, Perm, Omsk, Khabarovsk, Kluichi, Norilsk)
    Gaussian noise was added to the location features to prevent overfitting to specific locations, whilst allowing XGBoost to create its own regions.

All Text:

  • Length / difference in length.
  • nGrams Features (n = 1,2,3) for title and description (Both Words and Characters).
    • Count of Ngrams (#, Sum, Diff, Max, Min).
    • Length / difference in length.
    • Count of Unique Ngrams.
    • Ratio of Intersect Ngrams.
    • Ratio of Unique Intersect Ngrams.
  • Distance Features between the titles and descriptions:
  • Special Character Counting & Ratio Features:
    • Counting & Ratio features of Capital Letters in title and description.
    • Counting & Ratio features of Special Letters (digits, punctuations, etc.) in title and description.
  • Similarity between sets of words/characters.
  • Fuzzywuzzy distances.
  • jellyfish distances.
  • Number of overlapping sets of n words (n=1,2,3).
  • Matching moving windows of strings.
  • Cross-matching columns (eg. title1 with description2).

Bag of words:

For each of the text columns, we created a bag of words for both the intersection of words and the difference in words and encoded these in a sparse format resulting in ~80,000 columns each. We then used this to build Naive Bayes, SGD and similar models to be used as features.

Price Features:

  • Price Ratio.
  • Is both/one price NaN.
  • Total Price.

JSON Features:

  • Attribute Counting Features.
    • Sum, diff, max, min.
  • Count of Common Attributes Names.
  • Count of Common Attributes Values.
  • Weights of Evidence on keys/values, XGBoost model on sparse encoded attributes.

Image Features:

  • # of Images in each Set.
  • Difference Hashing of images.
  • Hamming distance between each pair of images.
  • Pairwise comparison of file size of each image.
  • Pairwise comparison of dimension of each image.
  • BRISK keypoint/descriptor matching.
  • Image histogram comparisons.
  • Dominant colour analysis.
  • Uniqueness of images (how many other items have the same images).
  • Difference in number of images.

Clusters:

We found clusters of rows by grouping rows which contain the same items (eg. if row1 has items 123, 456 and row2 has items 456, 789 they are in the same cluster). We discovered that the size of these clusters was a very good feature (larger clusters were more likely to be non-duplicates), as well as the fact that clusters always the same generationMethod. Adding cluster-size features gave us a 0.003 to 0.004 improvement.

Feature Engineering vol 2 : The ones that did not make it

Overfitting was probably the biggest problem throughout the competition, and lots of features which (over)performed in validation didn't do so well on the leaderboard. This is likely because the very powerful features learn to recognise specific products or sellers that do not appear in the test set. Hence, a feature graveyard was a necessary evil.

TF-IDF:

This was something we tried very early into the competition, adapting our code from the Home Depot competition. Unfortunately, it overfitted very strongly, netting us 0.98 val-auc and only 0.89 on LB. We tried adding noise, reducing complexity, but in the end we gave up.

Word2vec:

We tried both training a model on our cleaned data and using the pretrained model posted in the forums. We tried using word-mover distance from our model as features, but they were rather weak (0.70AUC) so in the end we decided to drop these for simplicity. Using the pre-trained model did not help, as the authors used MyStem for stemming (which is not open-source) so we could not replicate their data cleaning. After doing some transformations on the pre-trained model to try and make it work with our stemming (we got it down to about 20% missing words), it scored the same as our custom word2vec model.

Advanced cluster features:

We tried to expand the gain from our cluster features in several ways. We found that taking the mean prediction for the cluster as well as cluster_size * (1-cluster_mean) provided excellent features in validation (50% of gain in xgb importance), however these overfitted. We also tried taking features such as the standard deviation of locations of items in a cluster, but these overfitted too.

Grammar features:

We tried building features to fingerprint different types of sellers, such as usage of capital letters, special characters, newlines, punctuation etc. However while these helped a lot in CV, they overfitted on the leaderboard.

Brand violations:

We built some features based around words that could never appear together in duplicate listings. (For example, if one item wrote 'iPhone 4s' but the other one wrote 'iPhone 5s', they could not be duplicates). While they worked well at finding non-duplicates, there were just too few cases where these violations occurred to make a difference to the score.

Validation Scheme

Initially, we were using a random validation set before switching to a set of non-overlapping items, where none of the items in the valset appeared in the train set. This performed somewhat better, however we had failed to notice that the training set was ordered based on time! We later noticed this (inspired by this post) and switched to using last 33% as a valset.

validation scheme

Validation scheme.

This set correlated relatively well with the leaderboard until the last week, when we were doing meta-modelling and it fell apart - at a point where it would be too much work to switch to a better set. This hurt us a lot towards the end of the competition.

Modelling

Modelling vol 1 : The ones that made it

In this section we built various models (classifiers and regressors) on different input data each time (since the modelling process was overlapping with the feature engineering process. All models were training with the first 67% of the training data and validated on the remaining 33%. All predictions were saved (so that they can be used later for meta modelling. The most dominant models were:

XGBoost:

Trained with all 587 of our final features with 1000 estimators, maximum depth equal to 20 and minimum child of 10, and particularly high Eta (0.1) - bagged 5 times. We also replaced nan values with -1 and Infinity values with 99999.99. It scored 0.95143 on private leaderboard. Bagging added 0.00030 approximately.

Keras: Neural Network

Trained with all our final features, transformed with standard scaler as well as with logarithm plus 1, where all negative features have been replaced with zero. The main architecture involved 3 hidden layers with 800 hidden units plus 60% dropout. The main activation function was Softmax and all intermediate ones were standard rectifiers (Relu). We bagged it 10 times. It scored 0.94912 on private leaderboard. It gave +0.00080-90 when rank-averaged with the XGBoost model

Modelling vol 2: The ones that didn't

We build a couple of deeper Xgboost models with higher Eta (0.2) that although performed well in cv, they overfitted the leaderboard.

We used a couple of models to predict generation method in order to use that as feature for meta-modelling but it did not add anything so we removed it.

Meta-Modelling

The previous modelling process generated 14 different models including linear models as well as XGBoosts and NNs, that were later used for meta-modelling

For validation purposes we splitted the remaining (33%) data again into 67-33 in order to tune the hyper parameters of our meta-models that used as input the aforementioned 14 models. Sklearn's Random Forest which performed slightly better than XGBoost (0.95290 vs 0.95286). Their rank average yielded our best Leaderboard score of 0.95294

The Modelling and Meta-Modelling process is also illustrated below :

The Quants ensemble

The Quants ensemble.

Thanks

Thanks to the competitors for the challenge, Kaggle for hosting, Avito for organizing. Thanks to the open source community and the research that makes it all possible.

Teamwork

How did your team form?

Early on into the competition Peter & Sonny & Mikel formed a team as they held the top 3 spots at the time, and decided to join forces to see how far they could go. Later on, Marios was spotted lurking at the bottom of the leaderboard, and was asked to join because of his extensive Kaggle experience.

How did your team work together?

We were all quite independent, branching out and each working on our own features as there was lots of ground to cover, while also brainstorming and discussing ideas together. At the end we came together to consolidate everything into one featurespace and to build models for it.

Bios

Mikel Bober-Irizar
Mikel Bober-Irizar (anokas) is a young and ambitious Data Scientist and Machine Learning Enthusiast. He has been participating in various predictive modelling competitions, and has also developed algorithms for various problems, including financial prediction and medical diagnosis. Mikel is currently finishing his studies at Royal Grammar School, Guildford, UK, and plans to go on to study Math or Computer Science.

Peter Borrmann
Priv.-Doz. Dr. Peter Borrmann (NoName) is head of The Quants Consulting focusing on quantitative modelling and strategy. Peter studied in Göttingen, Oldenburg and Bremen and has a Ph.D. in theoretical physics. He habilitated at the University of Oldenburg where he worked six years as a research assistant professor. Before starting his own company Peter worked at IBM Business Consulting Services in different roles.

Marios Michailidis
Marios Michailidis (KazAnova) is Manager of Data Science at Dunnhumby and part-time PhD in machine learning at University College London (UCL) with a focus on improving recommender systems. He has worked in both marketing and credit sectors in the UK Market and has led many analytics projects with various themes including: Acquisition, Retention, Uplift, fraud detection, portfolio optimization and more. In his spare time he has created KazAnova, a GUI for credit scoring 100% made in Java. He is former Kaggle #1.

Sonny Laskar
Sonny Laskar (Sonny Laskar) is an Analytics Consulting Manager at Microland (India) where he is building IT Operations Analytics platform. He has over eight years of experience spread across IT Infrastructure, Cloud and Machine learning. He holds an MBA from India's premiere B School IIM, Indore. He is an avid break dancer and loves solving logic puzzles.

Draper Satellite Image Chronology: Pure ML Solution | Damien Soukhavong

$
0
0
draper-competition-damien-soukhavong

The Draper Satellite Image Chronology Competition (Chronos) ran on Kaggle from April to June 2016. This competition, which was novel in a number of ways, challenged Kagglers to put order to time and space. That is, given a dataset of satellite images taken over the span of five days, the 424 brave competitors were required to determine their correct order. The challenge, which Draper hosted in order to contribute to a deeper understanding of how to process and analyze images, was a first for Kaggle--it allowed hand annotation as long as processes used were replicable.

While the winners of the competition used a mixture of machine learning, human intuition, and brute force, Damien Soukhavong (Laurae), a Competitions and Discussion Expert on Kaggle, explains in this interview how factors like the limited number of training samples which deterred others from using pure machine learning methods appealed to him. Read how he ingeniously minimized overfit by testing how his XGBoost solution generalized on new image sets he created himself. Ultimately his efforts led to an impressive leap in 242 positions from the public to private leaderboard.

Draper satellite images

The basics

What was your background prior to entering this challenge?

I hold an MSc in Auditing, Management Accounting, and Information Systems. I am self-taught in Data Science, Statistics, and Machine Learning since 2010. This autodidactism also helped me earn my MSc with distinctions with a thesis on data visualization. I worked with data for about 6 years from now. Although data science can be technical, I feel more with a creative and designing mind than a technical mind!

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I love manipulating images and generating code to process them, along with helping researchers making reproducible research in the image manipulation field. It helped me to look at convenient features that may at least generalize on unknown images unrelated to this competition.

How did you get started competing on Kaggle?

I found Kaggle randomly when a laboratory researcher asked me to look up for online competitions involving data. Kaggle delighted me quickly, as they provide different competition datasets, along with a convenient interface, straightforward to use even for newcomers. There are even tutorial competitions, which are like fast introductions to data science and machine learning techniques to get you productive immediately.

What made you decide to enter this competition?

Three reasons that may look as disadvantages to enter this competition:

  • It is an image-based competition, requiring preprocessing of images, along the selection of features which are not apparent at first sight.
  • It is about ordering images in time with a tiny dataset. Hence, one would not throw a Deep Learning model out of the box and expect it to work.
  • Overfitting is an issue "thanks to" the 70 training sets provided: I personally like leakage and overfitting issues, as fighting them is like avoiding the mine in a minefield (think: Minesweeper).

Let's get technical

What preprocessing and supervised learning methods did you use?

A visual overview of my methods is on the following picture:

A visual overview of methods

A visual overview of methods used. Click to open up a larger view.

For the preprocessing method, I started by registering and masking each set of images, so it aligns them and so they contain the information that are mutual on all the image of a same set. Then, I used a feature extraction technique to harvest general describing features from each image. I generated all the permutations of each set of five images along with their theoretical performance metric, raising the training sample size to 8400 (120 permutations per set, 72 sets), and reducing the overfitting issue to a very residual issue. This also turned the (5-class) ranking problem into a regression problem.


For the supervised learning method, I used Extreme Gradient Boosting (XGBoost) with a custom objective and custom evaluation function.

What was your most important insight into the data?

Hand labeling images was easy once you trained yourself to recognize the related features (think: the neural network in your brain), and I could go for a (near) perfect score if I wanted to. However, my interest was to use pure machine learning techniques, which are generalizing on unknown samples. Thus, I did not explore the manual way for too long.

A simple example for hand-labeling using objects:

Example of hand-labeling objects

Simple example for hand-labeling.

I made a tutorial for recognizing the quantifying the apparition and removal of objects:

Tutorial on competition forums.

A tutorial shared on the competition forums.

Were you surprised by any of your findings?

I have found several interesting findings that surprised me:

  • Leave-one-out cross-validation (LOOCV) might be a good method to handle the tiny training set to validate a supervised machine learning model, however some image sets are leaking into others which make the cross-validation invalid right at the beginning!
  • Using the file size of pictures was leaking information out of the box… I believe one can get 0.30 (or more) using only the image file sizes with a simple predictive model.
  • Working with JPEG (1GB) pictures instead of TIFF (32GB) worked better than expected (I was expecting around 0.20 only).
  • Using a scoring model from the three highest scoring predictions gave a light boost on the performance metric. However, using predicted scores from all the 120 permuted sets per set to create the scoring model was giving out worse results than random predictions (overfitting).

This shows the model used can generalize to Draper's test sets. However, one must test the model in a real situation. Does it generalize? To test this hypothesis, I took two different image sets:

  • 20 random locations in Google Earth where the satellite images were crystal clear, and day-to-day (100 pictures, 20 sets in total)
  • 250 pictures from different areas I took myself when commuting during workdays (250 pictures, 50 sets)

On Google Earth, computing the performance metric from my predictions gave a score of 0.112, which is better than pure randomness. There is a slight bias towards a good prediction, though. Having access to the predictions and to the right order of the pictures, I assumed a uniform distribution of the predictions and ran a Monte-Carlo simulation over it. After 100,000 trials, here are the results:

Histogram showing results of Monte-Carlo simulation

Results of Monte-Carlo simulation.

I will not say it is bad, but it is not that good either. The mean is 0.09, while the standard deviation is 0.07. Assuming a normal distribution, we have about 90% chance to predict better than random numbers. Reaching 0.442 (the maximum after 100,000 random tries) is clearly not straightforward.

Now coming for my personal pictures, I got... 0.048 only, which is disappointing. To ensure this is not a mistake, I ran the same Monte-Carlo simulation I used previously but on the new predictions:

Histogram showing results of second Monte-Carlo simulation

Results of second Monte-Carlo simulation

Clearly, having 64% chance to predict better than random is on the low-end. The mean is 0.02, and the standard deviation is 0.07. With more samples to train the model on, overfitting should be less an issue than it is on non-aerial imagery.

Which tools did you use?

I used macros in ImageJ for the feature extraction (more specifically: Fiji version, a supercharged ImageJ version), and R + XGBoost for the final preprocessing and supervised learning. When using XGBoost, I used my custom objective and evaluation functions to not only get the global performance metric, but also account for the ranking of predictions in the set (for voting which picture order is the most probable).

XGBoost was in version 0.47. The current XGBoost version 0.60 refuses to run my objective and evaluation functions properly.

How did you spend your time on this competition?

I spent about 20% of the time reading the threads on the forum, as they are a wealth of information you may not find yourself. 60% of the time was spent on preprocessing and feature engineering, and the 20% left on predictive modeling, validating, and submitting.

I also tested different models and features after my first idea (ironically, it consumed over 1 day):

  • Deep Learning gives no result, as it predicts random numbers and never seems converge even after 500 epochs (I tried data augmentation, image manipulation, and many architectures such as CaffeNet, GoogLeNet, VGG-16, ResNet-19, ResNet-50…).
  • Neural networks on the file sizes… are incredible as they abuse leakage. They will not generalize on a real scenario.
  • Random Forests are… overfitting severely and hard to control. They do overfit with enough noise, and this dataset is a good one for such issue (in pictures there can be… clouds!).
  • Data Visualization using Tableau, to lookup for the best interactions between features. It is clearly hard to notice the interactions, although XGBoost managed a great score for such hard task.

What was the run time for both training and prediction of your solution?

Early on entering the competition, I set a macro in ImageJ very quickly to extract features and to save them in a CSV file. It took about 1 hour, which allowed me to setup the final preprocessing and XGBoost code properly in R. Afterwards, it took only 30 minutes to make the code work properly, preprocess the data, train the model, and make my first submission. I spent 10 more minutes to cross-validate using five targeted folds by categorizing pictures by theme, and make a new submission (that was slightly better than the former).

In total, this took me about 1 hour and a half. Knowing my cross-validation method was correct, I did not care about seeing such low performing score on the public leaderboard with only 17% of testing samples (thanks to the forum threads). This ensured a push by over 242 places from the public to the private leaderboard (ending 32nd), the former being a sample feedback on unknown samples, and the latter being the one where we must maximize our performance (but we cannot see this private leaderboard until ending of the competition).

Rules acceptance

Words of wisdom

The future of satellite imagery and technology will allow us to revisit locations with unprecedented frequency. Now that you've participated in this competition, what broader applications do you think there could be for images such as these?

There are many applications for satellite imagery and its associated technology. As a spatial reconstruction on a time dimension, one may:

  • Analyze the land usage over time
  • Plan the development of urban and rural areas
  • Manage resources and disasters in a better way (like analyzing the damage of a fire)
  • Work on a very large database usable for virtual reality (so you can feel how San Francisco looks from your own town for instance)
  • Initiate statistical studies
  • Regulate the environment (in a legal way)
  • Pinpoint tactical and safe areas for humans (moving hospitals, etc.)

Currently, there are satellites capable of recording things we cannot see as humans. One of the most known is Landsat, whose pictures are usable for finding minerals or many other diverse elements (gas, oil...). Using data science and machine learning should give a hefty boom in predictive modeling from satellite imagery for businesses.

What have you taken away from this competition?

The benefits I have taken from this competition were:

  • Working with a tiny training set, as there were only 70 training sets.
  • Fighting overfitting, as learning observations is easier than learning to generalize with such tiny training set!
  • Rationally transforming the ranking problem into a regression problem, a skill that requires regular practice and smart ideas.

A minor benefit was using ImageJ not for pure research, but for a data science competition. I was not expecting to use it at all. I provided an example starter script for using ImageJ in this competition:

ImageJ starter script

Code from starter script for using ImageJ shared as a competition kernel.

Do you have any advice for those just getting started in data science?

Several key pieces of advice for newcomers in data science:

  • I cannot say it enough times: efficiency is key.
  • Another important piece of advice: learn to tell stories (storytelling). Non-technical managers will not care about "but my XGBoost model had 92% accuracy!" when they will ask you the value to get from such model in their business environment. A good story is worth thousands of words in less than ten seconds!

More specific advice about predictive modeling for newcomers:

  • Knowing your features and understanding how to validate your models, allows you to fight overfitting and underfitting appropriately.
  • Understanding the "business" behind what you are doing in data science when working with a dataset has high value: domain knowledge is a starting key to success.
  • Creating (only) (good) predictive models in a business environment does not mean you are a good data scientist: statistical and business knowledge must be learnt a way or another.
  • Thinking you can go the XGBoost way in any industry is naïve. Try to explain all the interactions caught between [insert 100+ feature names] in a 100+ tree XGBoost model.
  • Your Manager will not like seeing you lurk for improving your model accuracy from 55.01% to 55.02% during one week of work. Where is the value?

N.B: Many ideas you may have can fail miserably. It is valuable to be brave and scrape what you did to start from scratch. Otherwise, you might stick forever the wrong way in a specific project, trying to push something you cannot push any further. Learning from mistakes is a key factor for self-improvement.

For pure supervised machine learning advice, three elements to always keep an eye on:

  • Choosing the appropriate way to validate a model depending on the dataset, and this comes with experience.
  • Looking up for potential leakage, as it may be what invalidates your model at the end when dealing with unknown samples.
  • Engineering features pushes your score higher than tuning hyperparameters in 99.99% of cases, unless you are using a horrible combination of hyperparameters right at the beginning.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

I would run a data compression problem. It would open the way for finding novel methods to optimize the loss of information to a minimum, while decreasing the feature count to a bare minimum, using supervised methods.

What is your dream job?

Being a data science evangelist and managing talents!

Bio

Damien Soukhavong
Damien Soukhavong is a data science and an artificial intelligence trainer. Graduated from an MSc in Auditing, Accounting Management, and Information Systems, he seeks the maximization of value of data in companies using data science and machine learning, while looking for efficiency in performance. He mentors individually creative professionals and designers who are investing time to push data science for their daily creative/design job.

Draper Satellite Image Chronology: Pure ML Solution | Vicens Gaitan

$
0
0
draper_satellite_chronology_vicens_gaitan

Can you put order to space and time? This was the challenge posed to competitors of the Draper Satellite Image Chronology Competition (Chronos) which ran on Kaggle from April to June 2016. Over four-hundred Kagglers chose a path somewhere between man and machine to accurately determine the chronological order of satellite images taken over five day spans.

In collaboration with Kaggle, Draper designed the competition to stimulate the development of novel approaches to analyzing satellite imagery and other image-based datasets. In this spirit, competitors’ solutions were permitted to rely on hand annotations as long as their methodologies were replicable. And indeed the top-three winners used non-ML approaches.

In this interview, Vicens Gaitan, a Competitions Master, describes how re-assembling the arrow of time was an irresistible challenge given his background in high energy physics. Having chosen the machine learning only path, Vicens spent about half of his time on image processing, specifically registration, and his remaining efforts went into building his XGBoost model which landed him well within the top 10%.

The basics

What was your background prior to entering this challenge?

I’d like to define myself as a high-energy physicist, but this was in the 90’s ... ooh, I’m talking about the past century! At that time we used neural networks to classify events coming from the detectors, and NN were cool. Currently I’m managing a team of data scientists in a software engineering firm, and working actively on machine learning projects. Now NN are again cool ☺

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

In fact, no. It was a very good opportunity to learn about image processing.

How did you get started competing on Kaggle?

I discovered Kaggle through the Higgs Boson challenge. Immediately I realized that competing on Kaggle is the most efficient way to follow the state-of-the art in machine learning, and compare our own methodologies against the best practitioners in the world.

What made you decide to enter this competition?

The “time arrow” is still an open problem in science. It was interesting to see if a machine learning approach could say something interesting about it. Moreover, I have always been fascinated by satellite imagery. I could not resist the temptation to enter the challenge!

Let's get technical

What preprocessing and supervised learning methods did you use?

From the start, I realize that the first thing to be done was to “register” the images: scale and rotate the images in order to match them, in the same way we do when building photo panoramas.

Because there was no available image registration package for R, I decide to build the process from scratch. And this was a lucky decision, because during this process I generated some features that were very useful for the learning task.
The registration process is described in detail in this notebook:

The main idea in the “registration” process is to find “interesting points” that can be identified across images. Those will be the so called keypoints. A simple way to define keypoints is to look for “corners”, because a corner will continue to be a corner (approximately) if you rotate and scale the images.

Left: Image gradient square blurred. Right: Keypoints (local maxima from left image).

Left: Image gradient square blurred. Right: Keypoints (local maxima from left image).

Once you have the key-points, it is necessary to describe them in an economical way using few numbers, to allow finding the matches efficiently. 30x30 rotated patches around the key-points, subsampled to 9x9 seem to work well for the granularity of these images.

Example of keypoint descriptors

Example of keypoint descriptors.

To do the match, k-nearest neighbors of descriptors comes into play and the magic of RANSAC to select the true matches over the noise.

Image key-points

In red, all the key-points. In green, the ones matching between images.

Once we have the matching points, it is easy to fit a homomorphic transformation between the images, and then compare them and look for temporal differences.

image05

Are you able to do that by hand? I’m not, so I take the machine learning path:

The number of available classified images is low, so it seems more promising to build a model at keypoint level (we have one hundred per image) using the descriptor information, and then average them for each image. Also it seems to be a good idea to predict temporal differences between keypoints, instead to predict directly de-position in the ordered sequence.

Let’s describe the procedure:

Feature engineering: (this was the easiest one, because it was already done after the registration process)

a) Key point detection For each image: Detect keypoints. Generate descriptors: 30x30 (downsampled to 9x9) oriented patches for each keypoint

b) Inter-set registering: For all sets, for every couple of image: Identify common points using RANSAC and fit a homomorphic transformation. Keep values of the transformation and patches for the identified inliers. The parameters of the homomorphic transformation will result in informative feature when comparing couples of images.

c) Extra set registering: Interestingly, the same idea can be used between images coming from different sets. Sampling from the full database of descriptors, using knn, find candidates to neighbor sets, and then register every possible image from one set to every image to the second set. Look for “almost perfect” matching taking into account a combination of number of identified common keypoints and the rmse in the overlapping region as a matching value. This procedure automatically detects “neighbor sets”

d) Set Clustering: using the previous matching information build a graph with sets as nodes and edges between neighbor sets with weight proportional to the matching value. It is relatively easy to select couple of images in neighbor sets with “perfect match” (High number of inliers, and very low rmse in the intersecting area) The connected component of this graph gives the set clusters without need to explicitly georeference the images by hand. This saves a lot of manual work.

image01

PATCH LEVEL MODEL (Siamese gradient boosting)

The model is trained over pairs of common keypoints in images of the same set, trying to predict the number of days between one image and the other (with sign). The features used are:

Patch im1 (9x9), Patch im2 (9x9), coefficients from the homomorphic transformation (h1..h9) , number of inliers between both images and rmse in the overlapping region

The model is trained with XGBoost to a reg:linear objective using 4-fold cross validation (assuring that images in the same cluster belongs to the same fold). Surprisingly, this model is able to discriminate the time arrow quite well.

image00

IMAGE LEVEL MODEL

Average the contribution of all patches in an image. Patches with absolute value of DeltaT below a certain threshold level (alpha) are discarded. If the Patch Level Model were “perfect” this would result in the average of difference in days from one image to the rest of images in the set, so the expected values for an ordered set will be -2.5,-1.5,0,1.5,2.5 respectively. In practice, the ordering is calculated by sorting this average contribution.

Additionally, the average is refined by adding the average of images from overlapping sets (using the graph previously defined) taking into account the number of inliers between both images and the rmse in the intersection of them, weighted with a parameter beta, and iterating until convergence.

The optimal alpha and beta are adjusted using public leader board feedback (the training set is not representative enough of the test set) and as expected, there was some overfitting and a dropping of five positions in the private leaderboard.

The local CV obtained is 0.83 +/-0.05.

The model scored 0.76786 in the public leaderboard and 0.69209 in the private. I’m convinced that with more training data, this methodology can get a score of.8x or even .9x.

What was your most important insight into the data?

Probably the fact that by comparing local keypoint descriptors it is possible to obtain some information about the deltaT between them. Not only the number of days, but also the sign. Of course, this is a very weak model, but ensembling for all keypoints in an image, and taking into account the deltaT obtained in the neighbor sets (selecting the image with “perfect match”).

Were you surprised by any of your findings?

Initially, I only used keypoints matched by RANSAC between couples of images. This implies that the descriptors are very similar. When I try to add all keypoints, the result doesn’t improve. This is telling us that the temporal information is coded in subtle variations of pixel intensities, and not big changes of macroscopic objects, like other presence or not of a car, for example.

In fact, the bigger surprise is that these descriptors are able to code some information about the time arrow

Which tools did you use?

I tried to develop the full pipeline in R. I used mainly the “imager” package for image processing and “xgboost” for the model. The “FNN” for fast k-nearest neighbors library was also very helpful because of its speed.

How did you spend your time on this competition? (For example: What proportion on feature engineering vs. machine learning?)

Half of the competition was devoted to developing the registration process. After that, building the model was relatively fast.

What was the run time for both training and prediction of your winning solution?

The keypoint extraction and descriptor calculation takes about 2 hours on a 12 core machine for the full dataset. Registration (especially the extra-set one) is more CPU consuming and lasts for more than 8 hours for the full dataset.

Then, model training can be done in few hours (I’m doing 4-fold cross validation for selecting the hyperparameters).

Words of wisdom

The future of satellite imagery and technology will allow us to revisit locations with unprecedented frequency. Now that you've participated in this competition, what broader applications do you think there could be for images such as these?

After the surprising fact that it is possible to correlate time arrow with image information, why not to try to predict changes in other indicators (with changes in a time interval of several days) that can be weakly correlated with the images:

Predict risk of failure of infrastructures, demographics, social behavior, level of wealth, happiness ... I’m just dreaming.

What have you taken away from this competition?

Knowledge in a previously unknown area. The fact that XGBoost can be useful for image challenges (you doubt it). And the most important, a lot of fun.

Do you have any advice for those just getting started in data science?

The only way to learn is to try by yourself. Don’t be afraid of problems in which you don’t have background. There is a lot of very interesting resources out there (forums, code... ), but the only way is hands in.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

I think that the most interesting open problem is, given a dataset and a learner, try to predict the optimal set of hyperparameters for that combination of dataset-model. I know, the problem is how to build a labeled set for this supervised problem... Maybe reinforcement learning can be a way to do that, but wait a moment... what should be the optimal hyperparameters for the policy learner?... It seems I’m entering in a kind of Goedelian loop...
Better forget it 😉 Why not to try to predict TV audiences based on time-stamp and EPG information?

What is your dream job?

I already have it 😉

Bio

Vicens GaitanVicens Gaitan is R&D director in the Grupo AIA innovation area. He studied physics and received a PhD in Machine Learning applied to experimental High Energy Physics in 1993 with the ALEPH collaboration at CERN. Since 1992 he has worked at AIA in complex problem solving and algorithmic development applied to model estimation, simulation, forecasting and optimization, mainly in the energy, banking, telecommunication and retail sectors for big companies. He has experienced knowledge in advanced methodologies including machine learning, numerical calculations, game theory and graph analysis, using lab environments like Python or R, or in production code in C and Java.


Grupo Bimbo Inventory Demand, Winners' Interview: Clustifier & Alex & Andrey

$
0
0
Grupo Bimbo Inventory Demand Kaggle Competition

The Grupo Bimbo Inventory Demand competition ran on Kaggle from June through August 2016. Over 2000 players on nearly as many teams competed to accurately forecast sales of Grupo Bimbo's delicious bakery goods. Kaggler Alex Ryzhkov came in second place with his teammates Clustifier and Andrey Kiryasov. In this interview, Alex describes how he and his team spent 95% of their time feature engineering their way to the top of the leaderboard. Read how the team used pseudo-labeling, typically used in deep learning, to improve their final forecast.

The basics

What was your background prior to entering this challenge?

I graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015. My scientific advisor was Alexander D’yakonov, who once was the Top-1 Kaggler worldwide, and I have learnt a lot of tips and tricks from him.

Alex Ryzhkov

Kaggler Alex Ryzhkov.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Of course I have 🙂 I participated in the first rotation of PZAD course held by Alexander D'yakonov, where we developed our practical skills in machine learning competitions. Moreover, after each competition I spent several days reading winning solutions and figuring out what I could have done better.

How did you get started competing on Kaggle?

Almost at the beginning of my education in the Mathematical Methods of Forecasting department in university I joined Kaggle and totally loved it.

What made you decide to enter this competition?

I enjoyed this competition in two ways. My passion is to work with time-series data and I have several qualification works on this type of data. The second reason is that I wanted to check how far I can go using Amazon AWS servers’ power.

Let's get technical

What preprocessing and supervised learning methods did you use?

For this competition we used several XGBoost, FTRL, and FFM models, and the initial dataset was hugely increased by:

  • different aggregations (mean, median, max, min etc.) of target and sales variables by week, product, client and town IDs;
  • New_Client_ID feature (for example, all OXXO shops have the same ID in it instead of different ones in the dataset from Bimbo);
  • features from products' names like weight, brand, number of pieces, weight of each piece;
  • Truncated SVD on TF-IDF matrix of client and product names
  • etc.
Top 100 clients by sales contains 48 Walmart stores.

Top 100 clients by sales contains 48 Walmart stores.

What was your most important insight into the data?

Since the public-private test dataset split was done in a time manner (one week in public and next week to private), we can't use features with lag equal to 1 in training our models. We did experiments for checking this point and models, which use lag_1 features, get worse score on private for 0.03-0.05 in logloss terms than models without these features.

Pipeline.

Lag1 features - include or not include? Definitely not!

Were you surprised by any of your findings?

It was surprising that initial client IDs worked as well as their clustered version. In the beginning of the competition I had an opinion that the initial ones have too much diversity but for the final model we saved both of them in the dataset.

Which tools did you use?

For this competition we used XGBoost packages in Python and R, as well as a Python implementation of FTRL algorithm and the FFM library for regression problems. To run heavy models on the whole dataset, spot Amazon r3.8xlarge servers were the best variant - fast and with huge RAM.

How did you spend your time on this competition?

From my point of view, it was a feature engineering competition. After my first script with XGBoost, I spent all of my time on preprocessing client and products tables, working with towns and states, creating new aggregations on sales and target variables. So it was 95% of time for feature engineering and only 5% for machine learning.

What was the run time for both training and prediction of your winning solution?

If we run it on r3.8xlarge, it will take around 146 hours (6 days) including feature engineering, training and predicting steps.

Final ensemble

Our final ensemble with duration at each step.

Words of wisdom

What have you taken away from this competition?

It was really surprising that pseudo labeling techniques can work outside deep learning competitions. Also you should spend a lot of time thinking about your validation and prediction techniques - it can prevent you from losing your position in the end.

Do you have any advice for those just getting started in data science?

From my side, competitions with kernels enabled are the best teachers for beginners. You can find all variety of scripts there - from simple (like all zeros or random forest on the whole initial dataset) to advanced ones (blend of several models with preprocessing and feature engineering). It's also useful to read topics on forum - you can get a number of ideas from other competitors' posts. The last advice but in my opinion the best one - don't give up!

Teamwork

How did your team form?

I was in top-20, when I got stuck and understood the necessity of new views and ideas to be in top-10 on private - at that time I merged with Clustifier and we started to work together. Later we joined with Andrey to be competitive with another top team - The Slippery Appraisals.

How did your team work together?

We had a chat in Skype (later in Google Hangouts), where we could discuss our ideas. Otherwise, all data was shared on Google Drive and we uploaded our first level submissions there. Moreover, I also shared my RStudio server on AWS with Clustifier, so we could easily work on the same files simultaneously.

How did competing on a team help you succeed?

Firstly, merging your ideas about one to two weeks before the end of competition increases your score. Secondly, you can exchange your ideas with teammates and each of them would implement those ideas in his own manner - this boosts your models even more. Finally, it's a nice time to share experience and tips & tricks, which help you to go up and improve stability of your solution before private LB.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

It would be nice to create a small challenge with leaderboard shake up prediction. This topic is always popular on forums near the end of each competition.

What is your dream job?

Data scientist outside Russia 🙂

Bio

Alexander Ryzhkov has graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015, where his scientific advisor was Alexander D’yakonov. Now he works as a software developer in Deutsche Bank Technology Center (Moscow).

TalkingData Mobile User Demographics Competition, Winners' Interview: 3rd Place, Team utc(+1,-3) | Danijel & Matias

$
0
0
TalkingData Mobile User Demographics competition winners' interview

The TalkingData Mobile User Demographics competition ran on Kaggle from July to September 2016. Nearly two-thousand players formed 1689 teams who competed to predict the gender of mobile users based on their app usage, geolocation, and mobile device properties. In this interview, Kagglers Danijel Kivaranovic and Matias Thayer, whose team utc(+1,-3) came in third place, describe their winning approach using Keras for "bag of apps" features and XGBoost for count features. They explain how actively sharing their solutions and exchanging ideas in Kernels gave them a competitive edge.

The basics

What was your background prior to entering this challenge?

Danijel: I am a Master’s student in Statistics at the University of Vienna. Further, I worked as a statistical consultant at the Medical University of Vienna.

Matias: I’ve participated in previous Kaggle competitions and most of my relevant experience came from there. Also I work as an analyst now and previously I worked as DBA/developer. I started doing online courses about 2 years ago through EDX, Coursera and MIT professional education. Over there I got familiarized with machine learning, statistics and tools such as R and Python.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Matias: Past competitions in Kaggle, and also familiarity with doing sql-like manipulations on data

Danijel: All prior experience I had comes from the Kaggle competitions I participated in. The datasets in medical research often have less than 100 observations and one is more interested in statistical inference than in black-box predictions. Of course, advanced machine learning tools are not even applicable to these small sample sizes.

How did you get started competing on Kaggle?

Danijel: I heard of Kaggle a few years ago but started my first competition last year (Rossmann Store Sales). Kaggle was a great opportunity to revise what I have learned, improve my programming skills and especially my machine learning knowledge.

Matias: My first competition was in October 2014. It was “15.071x - The Analytics Edge (Spring 2015)”. It was part of an MIT course through EDX. The competition was a lot of fun and I quickly got addicted to Kaggle competitions. I really like learning different viewpoints on hard problems and Kaggle is great for that.

What made you decide to enter this competition?

Matias: I liked the fact that the amount of data was relatively small which means you can do many experiments on a normal laptop. Also, I really like when the data is not “hidden” and you can think and try different hypotheses about it.

Danijel: The data had to fit in RAM. My PC only has 6GB RAM.

Let's get technical

What preprocessing and supervised learning methods did you use?

Danijel: Around 2/3 of devices had no events and the only information available was phone brand, device model and the binary flag if a device has events or not. So, from the beginning, I started to train separate models for devices with/without events. I only used xgboost and keras for modelling.

I used completely different features for my xgboost models and my keras models which was especially beneficial for ensembling afterwards.

Three types of features were used for xgboost:

  1. count how often each category appears in the app list (all apps on the device)
  2. count how often each app appears in the event list
  3. count at which hour and at which weekday the events happened. And also median latitude and longitude of events.

As many other competitors I used the “bag of apps” features for my keras models.

Instead of directly optimizing the logloss, I also tried a 2-stage procedure where I used the gender feature as a meta feature in the second stage. The procedure is:

  1. Create the set of features
  2. Predict the probability of gender (Stage 1)
  3. Use gender as additional feature and predict the probability of age groups (Stage 2)
  4. Combine the predictions using the definition of conditional probability: P(A_i, F) = P(A_i| F) P(F) for i = 1,\dots,6 and P(A_i, M) = P(A_i| M) P(M) for i = 1,\dots,6, where the A_i denote the age groups 1 to 6, and F and M denote female and male, respectively.

This 2-Stage procedure significantly outperformed the standard approach for xgboost but was slightly worse for keras.

Matias: I started doing some analysis on the ratios of usage of different apps (in fact that analysis is here: Only Another Exploratory Analysis) and trying with bags of brands, models and labels.

Exploratory analysis of TalkingData competition data

Matias' kernel, Only another exploratory analysis. The distribution of devices with events across the groups is a bit different from the distribution without events.

Then I saw the script of Yibo (XGBoost in R) and I copied his way to encode everything to 1/0 (even the ratios). Then I started to use a bunch of xgb models as well as a glmnet model, blending them all. I was doing reasonable well (around 20-30 place on the LB) when I saw dune_dweller's script. At that time I was trying to learn Keras, so I used her feature engineering and plugged in a keras model. It had a great performance and boosted my score to the 17th position!

I decided to share this Keras script in Kaggle just to get some feedback: Keras on Labels and Brands.

Keras on labels and brands

Matias' kernel, Keras on labels and brands which borrows data manipulation from dune_dwellers' kernel, A linear model on apps and labels.

And our best model single model for devices with events is just that model with some new features and more layers and regularization. It scored 2.23452 on the LB.

The additional features to this model were:

  • TF-IDF of brand and model (for devices without events)
  • TF-IDF of brand, model and labels (for devices with events)
  • Frequency of brands and model names (that one produced a small but clear improvement)

We merged our teams with Danijel later in this competition, and he was doing something quite different. Together we started retraining some of our models on CV10 and CV5, bagging everything as much as our computers allowed. For the ensemble weights we used the optim package in R (iterated 3 times) and also built different ensembles for devices with events and devices without events.

When the leak issue was raised we were around 11th to 13th on the LB and we started to look for where it was. After realizing what it was by looking at the files in a simple spreadsheet my teammate Danijel built a clever matching script that, combined with our best submission, allowed us to fight for the top places in those crazy 3 last days. We found also that devices with events weren’t taking advantage of the leak, so we only used leak models on non-events devices.

What was your most important insight into the data?

Matias: I found it interesting how the keras model worked out with a high dimension sparse matrix. Also I was really surprised after I opened the train and test sets in a simple spreadsheet. I need to do that more often.

Danijel: As already mentioned, I used two different set of features (count features for xgboost and “bag of apps” features for keras) that performed differently depending on the learning algorithm.

Xgboost: The count features outperformed the “bag of apps” features.

Keras: The “bag of apps” features outperformed the count features. I tried to scale the count features (standard and minmax scaling) but they still could not keep up with the “bag of apps” features which are all one-hot-encoded.

What was the run time for both training and prediction of your winning solution?

Danijel: The best single model takes less than an hour, however, the final ensemble takes a day approximately.

Matias: My best single model (keras) takes about 12 hours in a standard laptop. Mainly because it was bagged 5 times. My other secondary models took between 2-8 hours to run (usually overnight).

Words of wisdom

What have you taken away from this competition?

Matias: At first I thought that sharing my scripts as kernels would make me weaker in terms of ranking, because everyone could see my ideas, but to the contrary my final rank wasn’t bad and it helped me a lot to validate things and get really valuable feedback from other users. Also, I added NNET with Keras to my coding library.

Danijel: I learned three things:

  1. It is a huge step from Top 10% results to Top 10.
  2. I need better hardware.
  3. How to install Keras on a Windows PC.

Teamwork

How did your team form?

Danijel: I contacted a few top competitors. Matias was the first who agreed to merge.

Matias: I received an invitation from Danijel and I accepted it.

Red Hat Business Value Competition, 1st Place Winner's Interview: Darius Barušauskas

$
0
0
red-hat-predicting-business-value-kaggle-competition-1st-place-winners-interview

The Red Hat Predicting Business Value competition ran on Kaggle from August to September 2016. Well over two-thousand players competed on 2271 teams to accurately identify potential customers with the most business value based on their characteristics and activities. In this interview, Darius Barušauskas (AKA raddar) explains how he pursued and achieved his first solo gold medal with his 1st place finish. Now an accomplished Competitions Grandmaster after one year of competing on Kaggle, Darius shares his winning XGBoost solution plus his words of wisdom for aspiring data scientists.

The basics

What was your background prior to entering this challenge?

I have been on Kaggle for a year now and it has been very exciting time of my life. ☺ In my years working in data analytics I have obtained many useful data mining and ML skills which have flourished in the Kaggle competitions I’ve participated in.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

The problem itself was not new to me - I have made several new clients’ potential detection models in my work; they were designed differently compared to Red Hat‘s problem, but such experience helped to make useful feature transformations in this competition.

What made you decide to enter this competition?

I aimed for a solo gold medal to achieve my Grandmaster’s title - it took me only a year! I am very happy that I decided to dedicate all my spare time to this and that I was able to make my goals come true – got my top 10 overall rank, nice win and a hefty reward. ☺

This competition was a tight race. How did you approach it differently from past competitions?

I have always preferred working in a team. As this was a dedicated solo run, there were times when it was hard to concentrate and easy to procrastinate - had to look for moral support from my Kaggle friends. Thank you guys!

Let’s get technical

What was your most important insight into the data?

The presence of a leakage transformed original problem into 2 sub-problems which I tackled simultaneously:

a) Interpolating outcome values for companies with some leakage information
b) Predicting outcome values for companies not affected by leakage

I chose to turn leakage into several features for my ML models to directly predict value changing points in time - a contrast to many who were using some ad hoc rules.

The data itself presented several ways to tackle the problem given Red Hat’s client company-user-activity relation. I chose to make top-down approach models – create robust company-level models first and incorporate them into activity-level models using company users’ information.

image00

The main principle of my company-level models was to take first observation in time as a reference point for each company, then aggregate activities having same value outcome and create ML models based on that subset of data (similar model versions taking last observation as reference point as well). Having robust predictions of first and last observations translated well in capturing if/when company value changed in time. These models were critical for my solution to work, so I dedicated 90% of my time for that.

What preprocessing and supervised learning methods did you use?

My solution had a simple 4+2 model structure: 4 company-level XGBoost models incorporated in 2 activity-level XGBoost models. The first activity-level model was CV optimized (had very poor public LB performance) and the other was selected giving best public LB score; a combination of these strategies provided a huge score uplift in my final submission.

Other methods did not work as well as XGBoost. I did not want my solution to be complicated due to leak presence, so I just stuck with XGBoost. Microsoft’s brand new LightGBM would have produced even better results. So if the competition was a month or two later, I would have probably preferred LightGBM.

What was the run time for both training and prediction of your winning solution?

Due to the simplicity of the solution, it takes only a few hours on 12 thread CPU and RAM friendly environment.

How did you use Kernels in this competition?

I have produced my very first popular Kaggle kernel! I had not used sparse matrices before (surprise?) - seeing how easily these can be created and manipulated in R, I wanted to share with everyone.

Darius' first popular kernel, 0.98 XGBoost on sparse matrix.

Darius' first popular kernel, 0.98 XGBoost on sparse matrix.

Words of wisdom:

What have you taken away from this competition?

  • Leave no stones unturned when it comes to testing silly ideas.
  • Combination of cross-validation and public LB overfitting approach can yield surprisingly good results. Did not expect that.
  • Competing solo at high ranks is very tough.

Do you have any advice for those just getting started in data science?

  • Try running simple Kaggle kernels written by others and try to understand what is going on. Asking questions and receiving answers is the fastest way to know how things are done.
  • Try to acquire technical skills first - try as many methods as you can, create your own code templates for running and making predictions on any given dataset.
  • Learn how to do proper cross-validation and understand why it is important
  • Don’t let XGBoost be the only tool in your toolbox.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

Given a list of personal names and login names predict any risk related event using only external public internet data, especially social networks.

What is your dream job?

Developing data science models to improve the quality of everyone’s daily life.

Bio

Darius Barušauskas has BSc and MSc in Econometrics (Vilnius University, Lithuania). Specializes in credit and other risk modelling (5+ years of experience), has created many different models for financial, telco and utilities sectors. R and SQL guru.

Painter by Numbers Competition, 1st Place Winner's Interview: Nejc Ilenič

$
0
0
Painter by Numbers 1st Place Competition Winner's Interview

Does every painter leave a fingerprint? Accurately distinguishing the artwork of a master from a forgery can mean a difference in millions of dollars. In the Painter by Numbers playground competition hosted by Kiri Nichol (AKA small yellow duck), Kagglers were challenged to identify whether pairs of paintings were created by the same artist.

In this winner's interview, Nejc Ilenič takes us through his first place solution to this painter recognition challenge. His combination of unsupervised and supervised learning methods helped him achieve a final AUC of 0.9289. The greatest testament to his final model's performance? His model generally predicts greater similarity among authentic works of art by Johannes Vermeer compared to imitations by the fraudulent artist, Han van Meegeren.

section-divider

The Basics

What was your background prior to entering this challenge?

I’m currently finishing my master’s degree in computer science at University of Ljubljana. I began learning about data science five months before entering this competition by taking a data mining course offered by my faculty.

What made you decide to enter this competition?

At the beginning of the course I remember being thrilled by the fact that one can predict digits from images by writing only few lines of Python code (i.e. implementing logistic regression). I soon realized that this is something I want to do in life so competing on Kaggle seemed like a reasonable next step to put my newly acquired skill set to the test. I’ve chosen this particular competition mostly because I find the domain, which the data originates from, intriguing.

section-divider

Let’s get technical

How did you tackle the problem and what methods did you use?

First I will briefly depict the dataset and the preprocessing methods I’ve used and after that I will describe how I have built and validated a predictive model. The complete source code of the project along with the description of the approaches can be found in this GitHub repository.

The training set is unbalanced and some classes are only present in the training set and some only in the test set. Additionally input images are of various dimensions. There are 79433 instances and 1584 unique painters in the training set and the test set is composed of 23817 instances. Predictions for approximately 22M pairs needed to be made for the submission.

The plot below shows number of paintings for each of the 1584 painters in the training set.

Number of examples per classes in the training set

Number of examples per classes in the training set.

Labeled images were split into training (0.9) and validation (0.1) sets in a stratified manner resulting in 71423 training examples and 8010 validation examples belonging to 1584 classes.

The model I’ve built assumes fixed-size inputs, so the first preprocessing step was to resize each image’s smallest dimension to 256 pixels (retaining the aspect ratio) and then cropping it at the center of the larger dimension, obtaining 256x256 images. Some information gets lost during this process and an alternative approach where multiple crops are taken from the same image was considered, but not used for the final solution due to much longer training times (bigger, but more correlated training set). Furthermore, mean values were subtracted from each feature in the data and the obtained values were normalized by dividing each dimension by its standard deviation. Preprocessing data statistics were computed from the subset of training instances. During the training phase random transformations (rotations, zooms, shifts, shears and flips) were applied to data in order to reduce overfitting. The latter assures that our model only rarely sees exactly the same example more than once.

There were two main approaches considered for verifying whether two instances belong to the same class. The unsupervised method involves training a model that can predict one of the 1584 classes and then taking a dot product of the two class distribution vectors (softmax outputs). The supervised method is an end-to-end metric learning approach called siamese network. The main idea is to replicate the model once for each input image and merge their outputs into a single vector, that can then be used to directly predict whether the two images were painted by the same artist. An important aspect of this architecture is that the weights of both models are shared and during backpropagation the total gradient is the sum of the gradients contributed by the two models. Since the model trained for the unsupervised technique can also be used in the siamese architecture, most of the effort went into the multi-class painter recognition task.

The depiction below illustrates the architecture of the final convolutional neural network with non-linearities, dropouts and batch normalization layers omitted. 3x3 convolutional filters with stride 1 are used to produce feature maps, that are two neurons smaller along each of the two dimensions, than their input volumes. Zero padding is then used to retain the original shape and 2x2 max pooling with stride 2 halves the number of neurons along each of the two dimensions. Non-linearities are applied to convolution and fully connected outputs using the PReLU function (Leaky ReLU with trainable slope parameter in the negative part). Dense layers at the end of the architecture are the reason why fixed-size inputs need to be fed to the network. The model is regularized using dropout, batch normalization layers and L2 weight penalties.

Final ConvNet architecture.

Final ConvNet architecture.

300 epochs are needed for model to converge to the local minima using the Adam optimizer with 7.4e-05 learning rate and batch size of 96 examples. During training the cross-entropy loss was minimized.

Neural networks can be used as descriptor generators that produce lower dimensionality representations of input instances. One can think of them as automatic feature extractors. Such embeddings are obtained by simply taking the 2048 dimensional output vectors of the penultimate layer. To check whether there is any internal structure in the features produced by the ConvNet I’ve used the t-SNE dimensionality reduction technique. t-SNE is a convenient algorithm for visualization of high dimensional data and allows us to compare how similar input instances are. Below are two scatter plots of some of the artwork images of randomly selected artists from the validation set. Having in mind that the network hasn’t seen those examples during training and that the t-SNE algorithm doesn’t get class labels as inputs, the visual results are quite exciting.

t-SNE embeddings of the features generated by the ConvNet (click on the image for full resolution)

t-SNE embeddings of the features generated by the ConvNet (click on the image for full resolution).

The public leaderboard score was calculated on 70% of the submission pairs and the private leaderboard score on the remaining 30%. The final submission was generated using the unsupervised approach for verifying the same class identity. The best single ConvNet scored 0.90717 AUC on the private leaderboard and an ensemble of 18 best ConvNets trained during the hyper parameter search process scored 0.92890 AUC on the private leaderboard. Adding more (worse) models to the ensemble started to hurt the overall performance. A single hypothesis was obtained from multiple models as a weighted average of their predictions for the painter recognition task and only then the inner product of the two averaged class distribution vectors was calculated.

Were you surprised by any of the findings?

The administrator of the competition Kiri Nichol has posted some very useful insights into the performance of the algorithm on the private, test dataset. As stated on the competition forum, an ingenious Dutch forger Han van Meegeren was slipped into the test set in order to better understand how good the model is at extracting painters’ unique styles. The forger has replicated some of the world’s most famous artists’ work, including the paintings of Johannes Vermeer. Below is a pairwise comparison table of my best submission’s predictions for van Meegeren and Vermeer examples from the test set. Based on the model’s predictions it can be seen that Vermeer’s paintings are indeed more similar to each other than van Meegeren’s paintings are to Vermeer’s paintings. It can also be seen that Vermeer’s paintings are more similar to each other than van Meegeren’s paintings are to each other, due to van Meegeren forging paintings in the style of several different artists.

Pairwise comparison table for van Meegeren and Vermeer paintings from the test set.

Pairwise comparison table for van Meegeren and Vermeer paintings from the test set.

Another really valuable insight concerns the extrapolation of the model to artists that were not seen during training. The results are given in the form of AUC of my final submission for two different groups of instances from the test set. The first group consists of pairs of images whose painters were present in the training set: 0.94218 AUC and the second one is composed of pairs whose artists haven’t been seen by the model before: 0.82509 AUC.

Based on the results of the competition it can be concluded that convolutional neural networks are able to decompose artwork images’ visual space based on their painters unique style. The bad news is that the described algorithm is not good at extrapolating to unfamiliar artists. This is largely due to the fact that same identity verification is calculated directly from the two class distribution vectors.

Which tools did you use?

All of the code was written in Python and the most important libraries that were used are Keras (with Theano backend), NumPy and scikit-learn.

What was the run time for both training and prediction of your winning solution?

Training of the final ConvNet took a bit more than 4 days on a single GeForce GTX TITAN X GPU, prediction of the artists for 23817 test images took around 15 minutes and the time needed for calculating the inner products for 22M submission pairs was negligible compared to the training times.

section-divider

Words of wisdom

What have you taken away from this competition?

As my first Kaggle competition this was an excellent learning experience and since I’m planning to continue the work as my upcoming master’s degree thesis it was also a great opportunity for me to gain more knowledge about possible pitfalls and challenges in the domain. From this point forward my main focus will be on achieving better generalization by training an end-to-end metric learning technique called siamese network that was only briefly mentioned above.

At this point I would also like to thank Niko Colnerič, Tomaž Hočevar, Blaž Zupan, Jure Žbontar and other members of the Bioinformatics Laboratory from University of Ljubljana for their help and provision of the infrastructure.

Do you have any advice for those just getting started in data science?

I think that in order to really understand how something works, one has to implement it. This is especially important at the beginning, since one has no related knowledge to associate new reasonings to. So start by implementing simple algorithms and use those to create submissions for Getting Started competitions.

section-divider

Just for fun

What is your dream job?

Working as a data scientist in a diverse domain and with people from whom I can learn a lot.

Bio

Nejc Ilenič is currently an MSc student in computer science at University of Ljubljana, Slovenia. After graduation he aspires to pursue a career as a data scientist.

Integer Sequence Learning Competition: Solution Write-up, Team 1.618 | Gareth Jones & Laurent Borderie

$
0
0
Integer Sequence Learning Competition Solution Write-up

The Integer Sequence Learning playground competition, which ran on Kaggle from June to October 2016, was a unique challenge to its 300+ participants. Given sequences of integers sourced from the Online Encyclopedia of Integer Sequences, the goal was to predict the final number for each among the hundreds of thousands of sequences.

In this interview, Gareth Jones and Laurent Borderie (AKA WhizWilde) of Team 1.618 describe their approach (or rather, approaches) to solving many "small" data problems, how they would have used unsupervised methods to cluster sequence types were they to start over, and their advice to newcomers to machine learning.

section-divider

The basics

What was your background prior to entering this challenge?

Gareth: I’m a post-doc at the UCL Ear Institute, UK working on multisensory decision making and perception. I started as an experimental neuroscientist but have progressively become more computational.

Laurent: I work as a biological data analyst, switching from computational biology to bioinformatics and machine learning as needed. I am a pharmacist specialized in Research with Masters degrees in Neurobiology/pharmacology and in Project Management and Therapeutic Innovation in Biotechnology, but I became a computer scientist from studentships, past jobs, many MOOCs and out of personal interest for coding, machine learning, maths and AI.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Gareth: Not specifically with integer sequences, but I have experience working with handling and analysing large datasets and machine learning in general.

Laurent: I love maths and am always willing to discover more about advanced topics. Plus, analyzing data is like an occupational hazard for me and I have been involved in a few Kaggle competitions.

How did you get started competing on Kaggle?

Laurent: Through No Free Hunch. I wanted to test myself and I got hooked.

Gareth: I have completed a number of Coursera and edX courses over the past few years on machine learning, neuroscience, and finance. Kaggle was the next logical step in trying out the things I’ve learned.

What made you decide to enter this competition?

Laurent: As I said, I love maths. Plus, it has a scientific interest, and also, it was a change from the usual “data table format” which I had mainly approached.

Gareth: It’s an unusual and tricky problem from a machine learning perspective, one that really highlights the importance of always remaining skeptical of models that superficially fit well.

section-divider

Let’s get technical

What preprocessing and supervised learning methods did you use?

Gareth: Very little and only really linear regression. No major preprocessing was necessary for the approaches I tried, and, I think, this competition could require no supervised learning at all. Ultimately, most of my solutions were linear fits to a rolling window of previous points (Figure 1) (~37,000/115,000), however only around 15% of these were actually correct. The valuable solutions came from specific solvers that employed methods such as common differences (~15,000 sequences), pattern search (~2000), recurrence relation (~10,000), etc. The solvers I implemented are described in more detail here.

For sequences that were predicted using regression, the training data was generated from the sequence itself, rather than from the separate training set of sequences provided (Figure 1). A rolling window of a set number of previous terms of the sequence was applied, with the next term being the target. The last known term in the sequence was held out, and as many windows generated as possible with the remaining terms. The model was trained on this data and performance was assessed as % difference between the next prediction and the held-out term.

I also tried fitting SVMs using the same method to try and avoid imposing a parametric function on each sequence, but there wasn’t enough data available in the known terms of each sequence to train these successfully.

Figure 1 - Generation of training and validation data for sequences fit with linear polynomials. Training data used a set number of previous points with each term in the sequence being the predictive target. The final known term was used to validate accuracy of the model.

Figure 1 - Generation of training and validation data for sequences fit with linear polynomials. Training data used a set number of previous points with each term in the sequence being the predictive target. The final known term was used to validate accuracy of the model.

Laurent: My preprocessing was only meant for two things; either to facilitate sequence handling or allow for specific operations like scoring prediction through similarities or differences between tested and model sequences. Tagging also allowed using specified fallback methods in case no prediction could be made (either no prediction available, or depending on the setting, score under a threshold). I used a little linear regression, for some predictions. My approaches are detailed here.

I also threw the bases of an ML approach to learn from both general and positional tagging, but had no time to pursue it. I wanted to learn from each sequence as its own train set, or from classified/tagged parts of the set or from whole train set with both supervised and unsupervised methods. Tools I was interested in using were LSTMs, genetic algorithms in addition to more regular forests, Neural Nets, or SVMs.

What was your most important insight into the data?

Laurent: I had no special “revelation” about the data, though I had various ideas about how to approach the problem, which were mostly proven relevant. I used various signatures methods; model sequences set improvement by sequence reversing, shifting, merging; by addition of commonly known sequences; sequence tagging (I came with my own and added Gareth’s to it); specific scoring approaches. I knew that technical aspects (handling large numbers, efficient and flexible structures) would be crucial, and that sometimes mere statistics could be better that stretched predictions, because of inference problem (a “right” prediction for given sequence could be wrong for this sequence number, which address a specific problem in OEIS your methods are not understanding correctly).

Gareth: Two things:

  • That sequences can be grouped by their fundamental properties to inform solver selection, which I unfortunately never got around to integrating with my methods, but I think could lead to a more intelligent approach to the problem.
  • This problem isn’t one big data problem, rather it’s many small data problems. Each sequence is independent of all the other sequences, meaning that training of a model needs to use the inter-sequence data only, which is very limited. Another difficulty comes with scaling these small solutions to apply to hundreds of thousands of sequences without simply generating a harmful number of false-positive solutions. Fitting complex functions with very limited validation risked creating false positives at a harmful rate. False positives prevent fallback to sequence mode (the last-resort solution), and despite the mode being generally very poorly predictive if the next sequence term, it provided more value than trying too many fits.

Were you surprised by any of your findings?

Gareth: I’m a bit surprised how low the scores are overall, I don’t think we’ve totally cracked this problem yet.

It was possible to cheat in this competition by simply looking up answers on the OEIS, so I’m hoping others with (realistically) high scoring submissions will share their solutions – I’d like to know what else worked!

Laurent: After the competition, I submitted the results of a script that I could not merge in time with our results because of technical problems. It was quite poor alone (0.0722), but it actually boosted our results by 0.00586 from 0.21939 to 0.22515 by merely permuting its values with equal-to-zero prediction into our final submission, meaning it was an entire new set of mostly good predictions (despite losing some fraction of points by changing some solutions to wrong predictions) I am still running scripts with new twists to check and try to improve this.

Which tools did you use?

Gareth: R and RStudio.

Laurent: Python/Anaconda suite, plus a few unusual packages like Sympy, which allowed me to work with primes and some types of sequences.

How did you spend your time on this competition?

Gareth: Most of my time was spent trying to find and implement general methods to solve sequences. The rest was spent on trying to figure out which solvers (including ML techniques) were most reliable and in what order to apply these in to limit false-positive solutions.

Laurent: I employed time reading about the problem/advanced math topics, coding and optimizing processing and algorithms to be able to handle the enormous load that could represent using some methods, finding methods and testing which would have the best return and optimizing scoring for improving predictions in the end.

What was the run time for both training and prediction of your winning solution?

Laurent: From maybe 2 or 3 hours with reasonable settings to forever and a day when trying “brute force” approaches.

Gareth: Without any fitting, only around an hour. With linear regression, maybe 12 hours. With attempts to fit non-parametric or non-linear models (which never actually helped), considerably longer!

section-divider

Words of wisdom

What have you taken away from this competition?

Gareth: That if I were to start again, I’d totally restructure my assumptions to frame the problem differently.

Everything I did was based on an early assumption that the reliability of a solver would be the same for all types of sequences, meaning solver priority would be constant across sequence types. However, this is a gross oversimplification; consider, for example, mode-fallback (simply taking the mode of the known terms to predict the next). On average, it’s a very poor solver, but for a binary sequence where 99% of the terms are 0, it’s almost always correct. Conversely, there’s far less chance the mode will be correct for monotonic sequences.

A better approach might be to tag the basic properties of sequences, use some unsupervised learning to cluster types of sequences (Figure 2) and then to test solver reliability on each group. It would then be possible to dynamically apply solver priority based on these basic properties of a sequence. Perhaps it would even be possible to use the groups as training data, making this problem more accessible to more supervised learning techniques.

Figure 2 - Clustering of the tags given to 30 randomly sampled sequences, with examples from high level groups. Sequences from different groups are likely to have different classes of generating functions.

Figure 2 - Clustering of the tags given to 30 randomly sampled sequences, with examples from high level groups. Sequences from different groups are likely to have different classes of generating functions.

Laurent: As Gareth, I would maybe do the things differently.

  • I wanted to test a lot of things, but could only work on this during a period of 5 weeks. I focused on what appeared the most efficient, the use of solvers based on similarities and methods depending on algebra, and mostly integrated scripts that could run without my attention. But instead, I sometimes would have tested some individual approaches on the whole set in dedicated scripts, and I would have used their results to inform more accurately my next steps. Here it was more difficult to do, as I knew I could only have a shot at a main method with few defaults, so instead I wanted to give it the most power by allowing it to go to the limits.
  • I came with nice scoring but I also would have liked to better mix such methods with statistical and machine learning approaches, with improved tagging with ML methods in addition to tagging through algebra tests, and with a ML system to optimize scoring settings.
  • I would test more variants of my twists, too, like more complex sequence merging.
  • I would also anticipate scaling problems by coming with better methods to reduce processing to chunks, as method which proved fast and efficient with small and medium number of sequences unexpectedly proved absolutely impractical in the end.

Do you have any advice for those just getting started in data science?

Gareth: Always try and apply the things you learn; MOOCs for theory, Kaggle for practice. Also, try to get experience collecting data as well as analysing it. Experimental science is difficult and messy in ways that aren’t obvious simply by looking at a dataset. Understanding how and why will give will you a healthy skepticism for the quality of all data.

Laurent:

  • Don’t be shy, fearful, or discouraged by your lack of formal (or informal) knowledge. ML can be impressive and challenging as you discover it, but to become familiar with its fundamental concepts, you just need to toy with it, to play with algorithms and see the results, even before knowing their deep concepts. And it’s even easier today with suite and programming languages made just for this. You can learn what you need to challenge real theoretical walls in time.
  • When you learn, don’t lose yourself in books or MOOCs, choose one adapted to your path and go from start to the end, then toy again, practice, and perform in competitions or on real life problems.
  • Enjoy yourself. Think outside the box. Redefine the box. Read and talk to people on the forums
  • Take your time. I know there are Kagglers who took only a few months to become masters from scratch, but no need to rush. You just want to be skilled in the end, and there is no shortcut to your own path.

section-divider

Teamwork

How did your team form?

Gareth: Laurent and I discussed a lot of interesting ideas on the forums throughout the competition and decided to team up to combine our approaches.

Laurent: I appreciated Gareth’s clever insights and I felt it was very easy to understand him and reciprocally, so teaming only felt natural as it would allow sharing things more freely and efficiently. We just had to find a clever name and we were done.

How did your team work together?

Gareth: We teamed up fairly late in the competition and were working on different approaches in different scripting languages. Rather than using time converting, we generated submissions and combined them based on how reliable we estimated the methods available to solve each sequence were.

Laurent: We shared views on options and results selections, also, he ran some of my scripts that my computer couldn’t handle as they felt promising. Had we had worked together from the beginning, I guess it would have been a little different.

How did competing on a team help you succeed?

Gareth: Laurent implemented a number of difficult and valuable methods that I would never have had time to alone. Our discussions throughout the competition, even before teaming up, stimulated a lot of ideas.

Laurent: Well, Gareth already had good results by himself. I merely added a few things on top of that in the end. Some post-competition submission showed that I could have contributed even more to the final results. Sharing in a team is more fun, it allowed us to find better ideas. We could share hardware resources, which proven handy, too.

section-divider

Just for fun

What is your dream job?

Laurent: The same as now, with more people from various fields around me, with people to admire and others to coach, and more projects in ML.

Gareth: Something involving neuroscience and machine learning - either using machine learning to guide health decisions (like in the EEG seizure competition, which is currently taking up my spare time), or, conversely, using neuroscience to inform the development of machine learning algorithms and decision making in AI.

section-divider

Bios

Gareth Jones has a PhD in Neuroscience from The University of Sussex, UK and is currently a post-doc at the UCL Ear Institute, UK. His research uses electrophysiology, psychophysics, and computational modelling to investigate the neural mechanisms of sensory accumulation, multisensory information combination, and decision making.

Laurent Borderie works as a biological data analyst, he’s a PharmD, with Masters in Neurobiology and Biotechnology Therapeutic Innovation. He learned ML besides his academic curriculum, and his research interests includes Artificial intelligence, Sentiment and behavioral analysis, as well as brain computer interfaces, and biological analyses interpretation heuristics.

Viewing all 62 articles
Browse latest View live