Bosch Production Line Performance Competition Third Place Winners' Interview

The Bosch Production Line Performance competition ran on Kaggle from August to November 2016. Well over one thousand teams with 1602 players competed to reduce manufacturing failures using intricate data collected at every step along Bosch's assembly lines. Team Data Property Avengers, made up of Kaggle heavyweights Darragh Hanley (Darragh), Marios Michailidis (KazAnova), Mathias Müller (Faron), and Stanislav Semenov, came in third place by relying on their experience working with grouped time-series data in previous competitions plus a whole lot of feature engineering.

The banner image is from gingerman's kernel, Shopfloor Visualization 2.0.

The Basics

What was your background prior to entering this challenge?

Darragh Hanley: I am a part time OMSCS student at Georgia Tech (focus area Machine Learning) and a data scientist at Optum, using AI to improve people’s health and healthcare.

Darragh on Kaggle.

Marios Michailidis: I am a Part-Time PhD student at UCL, data science manager at Dunnhumby and fervent Kaggler.

Marios Michailidis on Kaggle.

Mathias Müller: I have a Master's in computer science (focus areas cognitive robotics and AI) and work as a machine learning engineer at FSD.

Mathias Müller (AKA Faron) on Kaggle.

Stanislav Semenov: I hold a Master's degree in Computer Science. I've worked as a data science consultant, teacher of machine learning classes, and quantitative researcher.

Stanislav Semenov on Kaggle.

How did you get started with Kaggle?

Darragh Hanley: I saw Kaggle as a good way to practice real world ML problems.

Marios Michailidis: I wanted a new challenge and learn from the best.

Mathias Müller: Kaggle was the best hit for “ML online competitions”.

Stanislav Semenov: I wanted to apply my knowledge in practical problems.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

We don’t have any such domain knowledge, but already have big experience with grouped time-series data (like in RedHat or Telstra Competition) to generate right features.

Let's Get Technical

Features

There was a lot of feature engineering.

Early in the competition it became clear there was a leak in the data set which only a few teams had found. Soon after, Mathias found the leak and released the magic features in a public kernel. The leak involved sequential components with the same numerical readings having a high rate of failure. The public release of these features opened the competition and remained our strongest feature throughout.

We also found extra information by using different criteria for what are sequential components. With so many different production lines and stations, two components could be sequential at one station, but then pass through different stations for the next phase of production. Given this we identified sequential components, date-wise, within each individual station with the same numerical readings. We found some stations, such as L3_S29 and L3_S30, worked particularly well as most components passed through these stations. This can be seen particularly well in John M's visualization of the manufacturing process. After identifying this, we could build on it by counting how many stations a pair of components had the same values in, or by counting the number of times those numerical reading occurred over the whole data set.

John M's visualization< of the manufacturing process

John M's kernel visualizing the manufacturing process.

We also saw varying trends in failure rates over time, both in the short and long term horizon. We trained models using rolling mean of the component failures sorted based on start and end dates of all stations. We calculated using different rolling windows – 5, 10, 20, 100, 1000, 5000 components – to catch both the long term and short term trends in failure rates. It was important to calculate such features out of fold to prevent overfitting. Below can be seen OOF rolling mean compared to usual rolling mean with a window size of 5 components.

OOF rolling mean compared to usual rolling mean with a window size of 5 components

Similarly we could capture the lag and lead of the target out of fold, something which tree based models would not capture well out of the box.

Besides this, we had a lot of usual features:

Encoded a few categorical columns with out of fold Bayesian Mean
Counts of non-duplicated categorical columns, counts of non-duplicated date and numeric columns.
Encoding the paths of components through the stations; whether components passed through the same sequence of stations.
Row wise NA counts for numeric, as well as max/min per stations.

Overall, we have more than 2000 features at 1st level models.

Validation

We used a 5-fold cross-validation. Unfortunately, our validation improvements have not always coincided with improvements on the leaderboard because of the discrete metric, Matthews correlation coefficient.

Models

There were about 160 models on the 1st level. Most of them are XGBoosts, but in this competition LightGBM has proved to be very good. We also have Extra Trees Classifiers, Neural Nets, Random Forests and Linear Models.

Meta-modeling

Meta modelling was really simple in this competition (compared to other competitions). We just used one bagged XGBoost on the 2nd level. Model selection for meta was done based on the same 5-fold cross-validation that we used at the base level. After model selection we had a total of 45 models. Any other stuff did not work.

Schematic of meta-modeling used.

Final Ensembling

In this competition we should predict strictly 0 or 1 – so in choosing a probability threshold, rows close to the threshold showed a lot of randomness – sometimes being predicted as 0, sometimes as 1. To mitigate against this, we used a majority vote of a number of different discrete predictions as our final selection. The result gave a nice boost - around 0.003 on public and private LB.

Below you can see a histogram (cut off on the y-axis) of the row wise sum of the 25 subs with some information on how many rows were always predicted as 1 or the other class or sometimes both classes across different submissions:

Histogram (cut off on the y-axis) of the row wise sum of the 25 subs.

Bios

Darragh Hanley (Darragh) is a Data Scientist at Optum, using AI to improve people’s health and healthcare. He has a special interest in predictive analytics; inferring and predicting human behavior. His Bachelor’s is in Engineering and Mathematics from Trinity College, Dublin, and he is currently studying for a Masters in Computer Science at Georgia Tech (OMSCS).

Marios Michailidis (KazAnova) is Manager of Data Science at Dunnhumby and part-time PhD in machine learning at University College London (UCL) with a focus on improving recommender systems. He has worked in both marketing and credit sectors in the UK Market and has led many analytics projects with various themes including: Acquisition, Retention, Uplift, fraud detection, portfolio optimization and more. In his spare time he has created KazAnova, a GUI for credit scoring 100% made in Java. He is former #1 Kaggler.

Mathias Müller (Faron) is a machine learning engineer for FSD Fahrzeugsystemdaten. He has a Master's in Computer Science from the Humboldt University of Berlin. His thesis was about 'Bio-Inspired Visual Navigation of Flying Robots'.

Stanislav Semenov (Stanislav Semenov) is a Data Scientist and Quantitative Researcher. He has extensive experience in solving practical problems of data analysis and machine learning, predictive modelling. Co-founder of Moscow ML Training Club, World Champion of Data Science Game (2016). At the current time, he holds 1st rank on Kaggle.

Bosch Production Line Performance Symposium Winners

Bosch's competition, which ran from August to November 2016, challenged Kagglers to predict rare manufacturing failures in order to improve production line performance. While the challenge was ongoing, participants had the opportunity to submit research papers based on the competition to the Symposium for Advanced Manufacturing at the 2016 IEEE International Conference on Big Data.

Based on peer review by experts in the field, three teams were chosen to receive $2,000 travel grants to present their work at the symposium in December in Washington D.C. In this blog post we congratulate Ankita Mangal and Nishant Kumar, Abhinav Maurya, and Bohdan Pavlyshenko for their awards and they share their approaches to the competition plus the research they presented at the symposium.

section-divider

Ankita Mangal & Nishant Kumar

What was your background prior to entering this challenge?

Nishant: I am currently a Data Scientist at Uber and also a Kaggle Master. I am a Mechanical Engineer turned Data Scientist and I have enjoyed working in the field of Machine Learning. I started with Kaggle 4 years ago and it has helped me a lot in improving my Machine Learning skills.There are a lot of techniques from Kaggle competitions which I use in my professional life at Uber. Uber is a great place to work, where we solve challenging problems like any other kaggle competition. We are currently hiring passionate Data Scientists/Engineers. My LinkedIn profile can be found here.

Nishant Kumar on Kaggle.

Ankita: I am a doctoral candidate in Materials Science & Engineering at Carnegie Mellon.
I am particularly interested in bringing together the field of data science and ICME(Integrated Computational Materials Engineering). I identify useful concepts from the field of machine learning, network analysis, structure mining and apply them to solving material science problems. I got interested in this challenge as it gave me a platform to apply skills from my current forte and prior experience in quality assurance management at Tata Steel Ltd. My LinkedIn profile can be found here.

Ankita Mangal on Kaggle.

Can you describe your approach in the Bosch competition?

The first thing we noticed was the dataset size (~14.3 Gb) and the large number of features (4265) of categorical (2140), numerical (968) and timestamp (1157) kind. The categorical features consisted of both single and multi-class labels and hence utilizing them via one-hot encoding presented it’s own problems because of the increased dimensionality of the feature space. Hence, we decided to use an online learning model with feature hashing implemented via Follow the Regularized Leader (FTRL) algorithm.

We divided the training dataset into two parts, and trained an online learning model on each part using categorical features only. Next, we used the model trained on one fold, to predict on the other fold, and constructed a probability column to be used as a feature in the next steps. This way, we have captured the information from the 2140 categorical features in just one numerical feature.

Then, we stacked this probability output from categorical features along with the remaining numerical and timestamp features, and used the Extreme Gradient boosting classifier to find the top 200 most important features. A final XGBoost model was then trained on these 200 features to get the final predictions.

Tell us about the paper you presented at the Symposium for Advanced Manufacturing

The Symposium gave us an unique opportunity to share our approach with other Kagglers, as well as a broader audience interested in the field of smart manufacturing ranging from company representatives to full time researchers. To cater to this audience, we explored the anonymized dataset to find insights about the assembly line and presented a picture of what’s happening at the shop floor. Every assembly line will have certain production flows, and to gain insights into this, we used the information contained in the feature names. We found out that the assembly line consists of 51 stations distributed between 4 production lines. Each station has different number of parts passing through it, which could mean the existence of different classes of products.

By comparing the number of measurements taken, defective product rate and the number of products passing through each station, we came to the conclusion that one of the stations ( number 32) is probably a re-processing or post processing station because it has the highest error rate, very few products pass through it, and there is only one kind of measurement taken there. (As illustrated in the below figure, the Error Rate/Fraction for Station 32 is the highest)

The timestamp features were also anonymized, and we calculated the autocorrelation between the number of products measured at each time unit as a function of time lag between them to understand the anonymized time units. We found out that the dataset consisted of measurements taken over 102.5 weeks and the measurements were recorded at a granularity of 6 minutes. Thus we can infer some structure about the timestamps from the anonymized features.

Using the machine learning techniques described above, we could build a model with an AUC of 0.716 and Matthews correlation coefficient of 0.23. This meant that this model could be used to tag products likely to fail resulting in a smarter failure detection problem, which is much better than checking for defects at random. With this model, out of the 1 million test samples, only 3000 are tagged likely to fail, which results in saving time and resources due to reduced product downgrading, increased salvage and higher production yields.

The final model showed that the most important features influencing product failure are:

categorical probability
time spent by a product in the production line
if products belonged to the same batch (same timeline), and
products passing through production line 3 (possibly because station 32 belongs to that).

The final model which we submitted ranked top 10% in the private leaderboard and included leakage (magic) features. These features use the fact that sequential components with same numerical/date readings will belong to a batch and hence have similar rate of failure. But this information is not available in real-time processes and hence we did not use them in this paper. Hence, the model mentioned in the paper can be applied at Bosch’s to reduce their failure rates. For more details, please refer to the paper here.

section-divider

Abhinav Maurya

What was your background prior to entering this challenge?

I am a PhD student in Information Systems at Carnegie Mellon University. My primary research interests are machine learning, data science, Bayesian statistics, and deep learning. I like designing and developing machine learning methods that scale to massive datasets, are easily interpretable by users, and help bridge the prediction-decision gap in machine learning by providing actionable insights into the data. My past research projects include a diverse set of socially relevant problems tackled through the lens of Bayesian statistics. My LinkedIn profile can be found here.

Abhinav Maurya on Kaggle.

Can you describe your approach in the Bosch competition?

In the Kaggle challenge, our goal was to detect if a manufactured part suffers from internal defects, based on sensor measurements from the assembly lines during the manufacturing process. There are two possible approaches to tackle this problem: Anomaly Detection and Binary Classification. We adopted the second approach in order to utilize anomaly supervision since the dataset contained datapoints that were specifically marked as anomalous.

Since internal defect rates using modern manufacturing processes are low due to excellent statistical quality control, the resulting dataset is often highly imbalanced with very few anomalous, positive datapoints. In order to deal with the severe imbalance in the number of positive and negative datapoints, we chose to learn a weight parameter “w” that trades off between the losses incurred on the positive and negative datapoints. Specific to the Bosch challenge, our approach was to design a Gaussian Process-based meta-optimization algorithm that directly optimized the required metric of Matthew’s Correlation Coefficient (MCC) using Gradient Boosting Machine (GBM) as a base classifier. The following figure provides a schematic overview of our system:

Tell us about the paper you presented at the Symposium for Advanced Manufacturing

Predicting internal failures in manufactured products is a challenging machine learning task due to two reasons: (i) the rarity of such failures in modern manufacturing processes, and (ii) the failure of traditional machine learning algorithms to optimize non-convex metrics such as Matthew’s Correlation Coefficient (MCC) used to measure performance on imbalanced datasets. In our paper, we presented “ImbalancedBayesOpt”, a meta-optimization algorithm that directly maximized MCC by learnings the optimal weights on the losses incurred on the positive and negative datapoints.

ImbalancedBayesOpt Visualized.

We used Gradient Boosting Machine (GBM) as the base classifier for our meta-optimization algorithm due to its competitive performance on machine learning prediction tasks. Using “ImbalancedBayesOpt”, we could significantly improve the classification performance of the base classifier on the severely imbalanced high-dimensional Bosch dataset for detecting rare internal manufacturing defects. Our presentation from the IEEE BigData 2016 conference can be found here.

section-divider

Bohdan Pavlyshenko

What was your background prior to entering this challenge?

I work as a Data Scientist at SoftServe (Ukraine) and I am an associate professor (Ph.D.) at electronics and computer technologies faculty of Ivan Franko National University of Lviv (Ukraine). I have Master level at Kaggle. Our team ”The Slippery Appraisals” won the Grupo Bimbo Inventory Demand Kaggle competition. My current scientific areas are: Data Mining, Predictive Analytics, Supply Chain analysis, Machine Learning, Information Retrieval, Text Mining, Natural Language Processing, R Analytics, Social Network Analysis, Big Data; semantic field approach in the analysis of semi-structured data.

Bohdan Pavlyshenko on Kaggle.

Can you describe your approach in the Bosch competition?

The main idea of our study in the Bosch Data Challenge was to show different approaches applying logistic regression to the problem of manufacturing failures detection. We considered the use of machine learning, linear and Bayesian models. The machine learning approach can give us the best-scored failure detection. The generalized linear model for the logistic regression makes it possible to investigate influence factors on the failure detection in the groups of manufacturing parts. Using Bayesian model, it is possible to receive the statistical distribution of model parameters, which can be used in the risk assessment analysis. Using 2-level models, we can receive more precise results. Using Bayesian model on the second level with the covariates that are the probabilities predicted by machine learning models on the first level, makes it possible to take into account the differences in results for machine learning models received for different sets of parameters and subsets of samples in case of highly imbalanced classes.

_blog_bosch_symposium_models

In our work, we did not invest a lot of time into features construction and selection due to the fact that features were anonymized, so we cannot apply different models of features interaction which are based on the domain of data. So, high scores of logistic regression was not our goal. As it is known, during competition so-called magic features were found, based on the samples ID which improve scoring essentially.

URL:https://www.kaggle.com/mmueller/bosch-production-line-performance/road-2-0-4, The Magic Feature : from LB 0.3-to 0.4+.
URL: https://www.kaggle.com/c/bosch-production-line-performance/forums/t/24065/themagical-feature-from-lb-0-3-to-0-4). In our presentation, we made calculations for 2 sets of features – set 1 contained the most important features and set 2 was the set 1 with added 4 magic features.

ROC curve for classification results for different sets of features (Set 2 is Set 1 with added magic features).

Matthews correlation coefficient for different sets of features.

Tell us about the paper you presented at the Symposium for Advanced Manufacturing

On the Bosch Production Line Performance Kaggle competition, Bosch invited participants to apply for one of three travel stipends to attend the conference and present a research from their work in this competition (https://www.kaggle.com/c/bosch-production-line-performance/details/ieee-bigdata-2016 ). I submitted my results of my scientific studies and according to review scores my paper was chosen for the symposium presentation and I won the travel grant to attend it (https://www.kaggle.com/c/bosch-production-line-performance/forums/t/25032/symposium-winners).

This symposium intended to provide a platform for researchers and industry practitioners from manufacturing, information science, and data science disciplines to share their data mining and big-data-analytics-related research results, and practical design or development experiences in the manufacturing industry.

At this symposium, the keynote speaker Dr. Rumi Ghosh had a very interesting speech “From Sensors to Sensing- Industrial Data Mining at Bosch. She informed the attendees about the Bosch Data Challenge on the Kaggle, and described the received results and problems concerning the analysis of data received from assembly lines during the manufacturing processes at Bosch.

At the symposium, I gave my talk “Machine Learning, Linear and Bayesian Models for Logistic Regression in Failure Detection Problems”. The main results from my speech are described in my article here. The conference and symposium were very interesting for me, there were many interesting talks, presentations and discussions. Special thanks to Bosch for organizing such an interesting Kaggle competition, Bosch Production Line Performance, and for awarding me the travel grant for attending the IEEE BigData 2016 conference!

The Seizure Prediction competition—hosted by Melbourne University AES, MathWorks, and NIH—challenged Kagglers to accurately forecast the occurrence of seizures using intracranial EEG recordings. Nearly 500 teams competed to distinguish between ten minute long data clips covering an hour prior to a seizure, and ten minute clips of interictal activity. In this interview, Kaggler Gareth Jones explains how he applied his background in neuroscience for the opportunity to make a positive impact on the lives of people affected by epilepsy. He discusses his approach to feature engineering with the raw data, the challenge of local cross-validation, plus his surprise at the effectiveness of training a single general model as opposed to patient-specific ones.

The basics

What was your background prior to entering this challenge?

I have a PhD in neuroscience and currently work as a post doc at the Ear Institute, University College London, UK. My work is in sensory processing, multisensory integration and decision making.

Gareth Jones on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I have experience collecting and analysing electrophysiological data, but not with seizure prediction from EEG data specifically.

What made you decide to enter this competition?

My background is in experimental and computational neuroscience, and it was exciting to find a topic that combined neuroscience and machine learning, and has the potential to have a direct therapeutic impact on people’s lives. If seizure prediction can be done reliably, particularly without too many false alarms, it may be able to greatly mitigate the danger and inconvenience of seizures to epilepsy suffers.

Let’s get technical:

What preprocessing and supervised learning methods did you use?

Raw preictal (before seizure) and intericatal (normal activity) intracranial EEG data recorded from implants in 3 human patients were provided for this competition (Figure 1). These patients were the lowest scorers (of 15) for prediction accuracy from a previous study using the NeuroVista Seizure Advisory System (Cook et. al, 2013).

Some basic pre-processing had already been done, but no ready-to-use features were included. The lack of existing features means more work initially, but isn’t necessarily a bad thing; having the raw data and being able to extract your own features is incredibly powerful, and allows much greater scope for reasoning in the feature engineering stage. This isn’t always possible when working with pre-prepared datasets, which often contain obscured features that require reverse engineering to get the most out of.

Figure 1 – 16 channel intercranial EEG data from interictal and preictal periods, plotted on a logarithmic x axis to visualise the data at multiple time scales.

The raw data needed a bit of additional pre-processing before features extraction. In the training set the data were split in to 10 minute recordings per file. Some of these 10 minute files were sequential with 5 other files, meaning 60 minutes of consecutive data was available. Other files in the training set (and all of the test set) were isolated 10 minute segments of data.

The sequential files were concatenated first in to 60 minute segments and the other, individual, 10 minute files were left as 10 minute segments. These segments were then subdivided in to discreet epochs of 50-400 s for feature processing (Figure 2), with each epoch and its extracted features representing one row in the data set with its corresponding feature columns. Features extracted from multiple epoch window lengths were joined together before training.

Figure 2 – temporal and frequency domain features were extracted from each individual epoch of raw data. In the frequency domain, these included EEG band powers and the correlation of these across channel. Across-channel correlations were also taken in the time domain, along with basic summary statistics for each channel.

For training, data from all three patients were used to train two general (rather than patient specific) models. I used an ensemble of a quadratic SVM and an RUS boosted tree ensemble with 100 learners (Figure 3), which performed well individually in early prototyping, despite the large class imbalances.

Finally, the predictions for each epoch were reduced (by mean) to a single prediction for each 10 minute segment (file) and then the segment predictions were combined across the models.

What was your most important insight into the data?

Local cross-validation was difficult in this competition and required an approach that grouped epoch data by segment, to prevent information leakage caused by the same segment being represented in both the training and cross-validation sets. This helped local accuracy a lot, but there was still a relatively large error between local and leaderboard scores to work around. The public leaderboard used only 30% of the test data, so overfitting was a huge risk (the final top ten for this competition had a net position gain of more than 100).

Training-wise ensembling the SVM and RUS boosted tree ensemble had the most significant, above-noise effect on score.

Feature-wise my most valuable insight was to combine features extracted from multiple epoch window lengths, which probably wouldn’t have been possible without having the raw data to work from. Identifying specifically which features were useful was difficult due to the cross-validation noise, but frequency powers, temporal summary statistics, and correlations between channels in both domains were all effective (Figure 2).

Were you surprised by any of your findings?

Two things surprised me, that training a general model rather than multiple patient-specific models worked at all, and that the most predictive frequency power bands were higher than I expected.

Often in seizure detection, models trained on single patients perform better than general models trained on multiple patient’s data. This is partly because there’s large variation between human brains, and partly because there’s not necessarily any correspondence between device channels across patients. These data came from three patients who were all implanted with the same device, with the same channel mapping, but each were implanted in different locations. The model’s predictions do vary noticeably between patients, so it’s clear the models had enough information to identify the patients. It remains to be seen if there’s any advantage in training a general model when testing on totally held out data, or to predict accurately for unseen patients.

Regarding the frequency power bands, my model included “typical” EEG frequency band powers and higher bands up to 200 Hz as features. The bands covering the range 40-150 Hz were more predictive of seizures than the lower frequency bands, which is not what I expected based on the previous UPenn and Mayo Clinic’s seizure detection competition, where Michael Hill’s winning entry used 1-47 Hz in 1 Hz bins. It’s also surprising given the lack of channel correspondence between patients when training a general model - higher frequency signals are more localised than lower frequency signals, so should be more patient-specific.

Figure 3 – Features extracted from different epoch window lengths were joined into one data set that was used to train a quadratic SVM and RUS Boosted tree ensemble. The test set was processed in the same way as the training set and each model produced predictions for each epoch in the test set. The ensembled predictions for each epoch were reduced to create a prediction for each of the files in the test set.

Which tools did you use?

MATLAB 2016b:

Classifier Leaner App
Statistics and Machine Learning Toolbox
Parallel processing toolbox

Although I usually use Kaggle as a way of practicing with Python and R, I stuck with MATLAB in this competition as it’s what I mostly use in my professional work. I also really like the Classifier Learner App to quickly try out different basic models. My code is available here.

How did you spend your time on this competition?

About 70% feature processing, split 50/50 between extraction from the raw data and engineering. The rest of my time was spent on developing more accurate cross-validation and training.

What was the run time for both training and prediction of your winning solution?

On a 4 GHz 4-core i7, around 6-12 hours in total (mostly dependent on how many epoch windows lengths needed to be extracted and combined), with extracting and processing features taking up 80% of the time. Training and predicting from the SVMs (~10 minutes) was very quick, whereas training and predicting from the tree ensembles was slower (30-60 mins, depending mostly on number of cross-validation folds).

Words of Wisdom

Do you have any advice for those just getting started in data science?

It’s important to try and appreciate the difficulties and shortcomings involved in the data collection and experimentation processes. The dataset provided in this competition is remarkable – it’s from chronic implants on human brains! Don’t forget that analysing data is only half the story, a lot of time, effort, and basic science went in to getting hold of it.

More generally, start with online courses on Cousera, Udacity, EdX, etc. but always practice what you’ve learned in real projects, and try and get hold of raw data whenever possible. It’s very important to have a healthy skepticism of all data; and each level of processing inevitably adds mistakes and assumptions that aren’t always obvious.

Just for fun

What is your dream job?

Anything involving neuroscience and machine learning - either using machine learning to guide health decisions, or, conversely, using neuroscience to inform development of machine learning approaches and AI.

Bio

Gareth Jones has a PhD in Neuroscience from The University of Sussex, UK and is currently a post-doc at the UCL Ear Institute, UK. His research uses electrophysiology, psychophysics, and computational modelling to investigate the neural mechanisms of sensory accumulation, multisensory information combination, and decision making.

Santander Product Recommendation Kaggle Competition 2nd Place Winner's Write-Up

The Santander Product Recommendation data science competition where the goal was to predict which new banking products customers were most likely to buy has just ended. After my earlier success in the Facebook recruiting competition I decided to have another go at competitive machine learning by competing with over 2,000 participants. This time I finished 2nd out of 1785 teams! In this post, I’ll explain my approach.

This solution write-up was originally published here by Tom Van de Wiele on his blog and cross-posted on No Free Hunch with his permission.

Overview

This blog post will cover all sections to go from the raw data to the final submissions. Here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.

The R source code is available on GitHub. This thread on the Kaggle forum discusses the solution on a higher level and is a good place to start if you participated in the challenge.

Introduction

Under their current system, a small number of Santander’s customers receive many recommendations while many others rarely see any resulting in an uneven customer experience. In their second competition, Santander is challenging Kagglers to predict which products their existing customers will use in the next month based on their past behavior and that of similar customers. With a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life.

The training data consists of nearly 1 million users with monthly historical user and product data between January 2015 and May 2016. User data consists of 24 predictors including the age and income of the users. Product data consists of boolean flags for all 24 products and indicates whether the user owned the product in the respective months. The goal is to predict which new products the 929,615 test users are most likely to buy in June 2016. A product is considered new if it is owned in June 2016 but not in May 2016. The next plot shows that most users in the test data set were already present in the first month of the train data and that a relatively large share of test users contains the first training information in July 2015. Nearly all test users contain monthly data between their first appearance in the train data and the end of the training period (May 2016).

First occurrence of the test users in the training data.

A ranked list of the top seven most likely new products is expected for all users in the test data. The leaderboard score is calculated using the MAP@7 criterion. The total score is the mean of the scores for all users. When no new products are bought, the MAP score is always zero and new products are only added for about 3.51% of the users. This means that the public score is only calculated on about 9800 users and that the perfect score is close to 0.035.

The test data is split randomly between the public and private leaderboard using a 30-70% random split. For those who are not familiar with Kaggle competitions: feedback is given during the competition on the public leaderboard whereas the private leaderboard is used to calculate the final standings.

Exploratory analysis

I wrote an interactive Shiny application to research the raw data. Feel free to explore the data yourself! This interactive analysis revealed many interesting patterns and was the major motivation for many of the base model features. The next plot shows the new product count in the training data for the top 9 products in the training months.

New product counts by time for the top 9 products

The popularity of products evolves over time but there are also yearly seasonal causes that impact the new product counts. June 2015 (left dotted line in the plot above) is especially interesting since it contains a quite different new product distribution (Cco_fin and Reca_fin in particular) compared to the other months, probably because June marks the end of the tax year in Spain. It will turn out later on in the analysis that the June 2015 new product information is by far the best indicator of new products in June 2016, especially because of divergent behavior in the tax product (Reca_fin) and the checking account (cco_fin). The most popular forum post suggests to restrict the modeling effort to new product records in June 2015 to predict June 2016. A crucial insight which changed the landscape of the competition after it was made public by one of the top competitors.

The interactive application also reveals that there is an important relation between the new product probability and the products that were owned in the previous month. The Nomina product is an extreme case: it is only bought if Nom_pens was owned in the previous month or if it is bought together with Nom_pens in the same month. Another interesting insight of the interactive application relates to the products that are frequently bought together. Cno_fin is frequently bought together with Nomina and Nom_pens. Most other new product purchases seem fairly independent. A final application of the interactive application shows the distribution of the continuous and categorical user predictors for users who bought new products in a specific month.

Strategy

A simplification of the overall strategy to generate a single submission is shown below. The final two submissions are ensembles of multiple single submissions with small variations in the base model combination and post-processing logic.

Single submission strategy.

The core elements of my approach are the base models. These are all trained on a single month of data for all 24 products. Each base model consists of an xgboost model of the new product probability, conditional on the absence of the product in the previous month. The base models are trained using all available historical information. This can only be achieved by calculating separate feature files for all months between February 2015 and May 2016. The models trained on February 2015 only use a single lag month whereas the models trained on May 2016 use 16 lag months. Several feature preparation steps are required before the feature files can be generated. Restricting the base models to use only the top features for each lag-product pair speeds up the modeling and evaluation process. The ranked list of features is obtained by combining the feature gain ranks of the 5-fold cross validation on the base models trained using all features. The base model predictions on the test data are combined using a linear combination of the base model predictions. The weights are obtained using public leaderboard information and local validation on May 2016 as well as a correlation study of the base model predictions. Several post-processing steps are applied to the weighted product predictions before generating a ranked list of the most likely June 2016 new products for all test users.

Feature engineering

The feature engineering files are calculated using different lags. The models trained on June 2015 for example are trained on features based on all 24 user data predictors up till and including June 2015 and product information before June 2015. This approach mimics the test data which also contains user data for June 2016. The test features were generated using the most recent months and were based on lag data in order to have similar feature interpretations. Consequently, the model trained on June 2015 which uses 5 lag months is evaluated on the test features calculated on only the lag data starting in January 2016.

Features were added in several iterations. I added similar features based on those that had a strong predictive value in the base models. Most of the valuable features are present in the lag information of previously owned products. I added lagged features of all products at month lags 1 to 6 and 12 and included features of the number of months since the (second) last positive (new product) and negative (dropped product) flanks. The counts of the positive and negative flanks during the entire lag period were also added as features for all products as well as the number of positive/negative flanks for the combination of all products in lags 1 to 6 and 12. An interesting observation was the fact that the income (renta) was non-unique for about 30% of the user base where most duplicates occurred in pairs and groups of size < 10. I assumed that these represented people from the same household and that this information could result in valuable features since people in the same household might show related patterns. Sadly, all the family related features I tried added little value.

I added a boolean flag for users that had data in May 2015 and June 2015 as users that were added after July 2015 showed different purchasing behavior. These features however added little value since the base models were already able to capture this different behavior using the other features. The raw data was always used in its raw form except for the income feature. Here I used the median for the province if it was missing and I also added a flag to indicate that the value was imputed. Categorical features were mapped to numeric features using an intuitive manual reordering for ordinal data and a dummy ordering for nominal data.

Other features were added to incorporate dynamic information in the lag period of the 24 user data predictors. Many of these predictors are however static and added limited value to the overall performance. It would be great to study the impact of changing income on the product purchasing behavior but that was not possible given the static income values in the given data set. I did not include interactions between the most important features and wish that I had after reading the approaches of several of the other top competitors.

Base models

The base models are binary xgboost models for all 24 products and all 16 months that showed positive flanks (February 2015 - May 2016). My main insight here was to use all the available data. This means that the models are trained on all users that did not contain the specific products in the previous months. Initially I was using a “marginal” model to calculate the probability of any positive flank and a “conditional” model to calculate the probability of a product positive flank given at least one positive flank. This results in a way faster fitting process since only about 3 to 4 percent of the users buys at least one product in a specific month but I found that I got slightly better results when modeling using all the data (the “joint” model).

The hyperparameters were decided based on the number of training positive flanks, the more positive flanks I observed in the train data, the deeper the trees.

All models were built using all the train data as well as the remaining data after excluding 10 mutually exclusive random folds. I tried several ways to stack the base model predictions but it seemed that the pattern differences were too variable over time for the value of stacking to kick in compared to a weighted average of the base model predictions.

I also tried to bootstrap the base models but this gave results that were consistently worse. To my current knowledge none of the top competitors got bootstrapping to work in this problem setting.

Base model combination

The base models from the previous section were fit to the test data where the number of used test lags was set to the number of lags in the train data. Most weight was given to June 15 but other months all contained valuable information too although I sometimes set the weight to zero for products where the patterns changed over time (end of this section). To find a good way to combine the base models it was a good idea to look at the data interactively. The second interactive Shiny application compares the base model predictions on the test set for the most important products and also shows other base model related information such as the confidence in the predictions. The following two screenshots give an impression of the application but here I would again like to invite you to have a look at the data interactively.

Base model correlations for cco_fin. June 2015 (Lag 5) and December 2015 (Lag 11) are special months.

Base model predictions comparison for cco_fin using Lag 5 (June 2015) and Lag 11 (December 2015). The Pearson correlation coefficient of the predictions is 0.86.

The interactive base model prediction application made me think of various ways to combine the base model predictions. I tried several weighted transformations but could not find a transformation that consistently worked better with respect to the target criterion than a simple weighted average of the base model predictions. Different weights were used for different products based on the interactive analysis and public leaderboard feedback. The weights of ctma_fin were for example set to 0 prior to October 2015 since the purchasing behavior seemed to obey to different rules prior to that date. Cco_fin showed particularly different behavior in June and December 2015 compared to other months and it seemed like a mixed distribution of typical cco_fin patterns and end of (tax) year specific patterns in those months. The table below shows the relative lag weights for all products. These are all normalized so that they sum to 1 for each product but are easier to interpret in their raw shape below. More recent months typically contribute more since they can model richer dynamics and most weight is given to June 2015. The weights for lags 1 (February 2015) and 2 (March 2015) are set to 0 for all products.

Relative base model weights by product.

Some product positive flanks such as nomina and nom_pens mostly rely on information from the previous lag but product positive flanks like recibo get more confident when more lag information is available. Let’s say for example that recibo was dropped in October 2015 and picked up in November 2015. The June 2015 model would not be able to use this information since only 5 test lag months are used to evaluate the model on the test set (January 2016 - May 2016). In these cases, I adjusted the probability to the probabilities of the models that use more data in some of my submissions. In hindsight, I wish that I had applied it to all submissions. The next plot compares the weighted prediction for May 2016 using the weights from the table above with the adjusted predictions that only incorporate a subset of the lags. The adjusted (purple) predictions are way closer to the predictions using the out-of-bag May 2016 model compared to the weighted approach that uses all base lags and can thus be considered preferable for the studied user.

Base model recibo predictions comparison for user 17211.

Post-processing

Product probability normalization The test probabilities are transformed by elevating them to an exponent such that the sum of the product probabilities matches the extrapolated public leaderboard count. An exponential transformation has the benefit over a linear transformation that it mostly affects low probabilities. Here it was important to realize that the probed public leaderboard scores don’t translate directly into positive leaderboard counts. Products like nomina which are frequently bought together with nom_pens and cno_fin to less extent are thus more probable than their relative MAP contribution.

Confidence incorporation Through a simulation study I could confirm my suspicion that in order to optimize the expected MAP, less confident predictions should be shrunk with respect to more confident predictions. I applied this in some of my submissions and this added limited but significant value to the final ensembles. I calculated confidence as mean(prediction given actual positive flank)/mean(prediction given no positive flank) where prediction was calculated on the out of fold records in the 10-fold cross validation of the base models.

Nomina Nom_pens reordering Nomina is never bought without nom_pens if nom_pens was not owned in the previous month. I simply swapped their predicted probabilities if nomina was ever ranked above nom_pens and if both were not owned in the previous month.

MAP optimization Imagine this situation: cco_fin has a positive flank probability of 0.3 and nomina and nom_pens both have a probability of 0.4 but they always share the same value. All other product probabilities are assumed to be zero. Which one should you rank on top in this situation? cco! The following plot shows a simulation study of the expected MAP where the combined (nomina and nom_pens) probability is set to 0.4 and the single (cco_fin) probability varies between 0.2 and 0.4. The plot agrees with the mathematical derivation which concludes that cco_fin should be ranked as the top product to maximize the expected MAP if its probability is above 0.294. I also closed “gaps” between nomina and nom_pens when the relative probability difference was limited. This MAP optimization had great effects in local validation (~0.2% boost) but limited value on the public leaderboard. It turns out that the effect on the private leaderboard had similar positive value but that I was overfitting, leading me to conclude falsely that MAP optimization had limited value overall.

Mean of MAP@7 using a simulation study for two possible orderings with the combined probability set to 0.4. The single probability varies between 0.2 and 0.4. 200,000 independent simulations are used for each data point.

Ensembling

I submitted two ensembles: one using my last 26 submissions where the weighted probability was calculated with the weights based on the correlation with the other submissions and the public leaderboard score. The second ensemble consisted of a manual selection of 4 of these 26 submissions that were again selected and weighted using their correlation with other submissions and public leaderboard feedback. Details of my submissions are available on GitHub. The next plot shows the public leaderboard score for all final 26 submissions versus the mean mutual rank correlation. I selected the manual subset using this graph and the rank correlation of the submissions in order to make the submissions as uncorrelated as possible while still obtaining good public leaderboard feedback.

Public leaderboard score versus the mean mutual rank correlation for the final 26 submissions. The four indicated submissions in red are used in the first of the two final submissions. The second final submission uses all 26 submissions.

I only got to ensembling on the last day of the competition and only discovered after the deadline that rank averaging typically works better than probability averaging. The main reason is probably the nature of the variations I included in my final submissions since most variations were in the post-processing steps.

Conclusion

The private leaderboard standing below, used to rank the teams, shows the top 30 teams. It was a very close competition on the public leaderboard between the top three teams but idle_speculation was able to generalize better making him a well-deserved winner of the competition. I am very happy with the second spot, especially given the difference between second, third and fourth, but I would be lying if I said that I hadn’t hoped for more for a long time. There was a large gap between first and second for several weeks but this competition lasted a couple of days too long for me to secure the top seat. I managed to make great progress during my first 10 days and could only achieve minor improvements during the last four weeks. Being on top for such a long time tempted me to make small incremental changes where I would only keep those changes if they improved my public score. With a 30-70% public/private leaderboard split this approach is prone to overfitting and in hindsight I wish that I had put more trust in my local validation. Applying trend detection and MAP optimization steps in all submissions would have improved my final score to about 0.03136 but idle_speculation would still have won the contest. I was impressed by the insights of many of the top competitors. You can read more about their approaches on the Kaggle forum.

Private leaderboard score (MAP@7) - idle_speculation stands out from the pack.

Running all steps on my 48GB workstation would take about a week. Generating a ~0.031 private leaderboard score (good for 11th place) could however be achieved in about 90 minutes by focusing on the most important base model features using my feature ranking and only using one model for each product-lag combination. I would suggest to consider only the top 10 features in the base model generation and omit the folds from the model generation if you are mostly interested in the approach rather than the result.

I really enjoyed working on this competition although I didn’t compete as passionately as I did in the Facebook competition. The funny thing is that I would never have participated had I not quit my pilgrimage on the famous Spanish “Camino del Norte” because of food poisoning in… Santander. I initially considered the Santander competition as a great way to keep busy whereas I saw the Facebook competition as a way to change my professional career. Being ahead for a long time also made me a little complacent but the final days on this competition brought back the great feeling of close competition. The numerous challenges at Google DeepMind will probably keep me away from Kaggle for a while but I hope to compete again in a couple of years with a greater toolbox!

I look forward to your comments and suggestions.

Bio

tom_van_de_wiele
Tom Van de Wiele recently completed his master of statistical data analysis at the University of Ghent. Tom has a background in computer science engineering and works as a Research Engineer at Google DeepMind.

Seizure Prediction Kaggle Competition First Place Winners' Interview

The Melbourne University Seizure Prediction competition ran on Kaggle from November to December 2016, attracting nearly 500 teams. Kagglers were challenged to forecast seizures by differentiating between pre-seizure (preictal) and post-seizure (interictal) states in a dataset of intracranial EEG recordings.

In this winners' interview, the first place Team Not-So-Random-Anymore discusses how their simple yet diverse feature sets helped them choose a stable winning ensemble robust to overfitting. Plus, domain experience along with lessons learned from past competitions contributed to the winning approach by Andriy Temko, Alexandre Barachant, Feng Li, and Gilberto Titericz Jr.

The basics

What was your background prior to entering this challenge?

Andriy Temko (AT): I’m an electrical engineer and have PhD in Telecommunications and Signal Processing, previous experience with predictive modelling in acoustic and biomedical signals. This is my first competition win.

Andriy Temko on Kaggle.

Alexandre Barachant (AB): I’m an electrical Engineer and PhD in signal processing. I don’t have an academic training in Machine Learning, but I learned quite a lot thanks to various professional and personal projects. This is not my first kaggle challenge, and I consider myself as a relatively good Data scientist. Saying that, there is still a large part of the field in which i’m not very experienced (I rarely use CNNs)

Alexandre Barachant on Kaggle.

Feng Li (FL): I got my bachelor’s degree in Statistics (Xiamen University) and I’m now pursuing my master’s degree in Data Science (University of Minnesota, Twin Cities). I joined Kaggle one year ago and keep learning from various kinds of competitions.

Feng Li on Kaggle.

Gilberto Titericz Jr (GT): I’m an electrical Engineer and MSC in Telecommunications. Since 2008 I’ve been learning machine learning techniques. In 2012 I joined Kaggle and competed in more than 80 competitions and won a few.

Gilberto Titericz Jr on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

AT: I believe so. Knowledge of EEG processing, some understanding of seizure generation mechanisms, previous successful works on automated seizure detection for newborns.

AB : A bit. I’m a specialist of Brain-Computer interfaces (BCI) and EEG processing. I spent most of my career designing classification algorithms for EEG signals. I’m not very experienced in seizure prediction, but I participated in the two other seizure challenges on Kaggle (not very successfully). Oh yes, I almost forgot to mention that I won 3 other EEG challenge organized on Kaggle Fair enough, I have a lot of domain knowledge and prior experience in this area.

FL: I don’t have prior domain knowledge in EEG data. I just read relevant papers and the solutions in previous seizure competitions on Kaggle.

GT: I don’t have prior domain knowledge in EEG signals. My main experience is in using machine learning to blend models and solve the most diverse problems.

How did you get started competing on Kaggle?

AT: My first competition was a good match to my education and background. I wouldn’t have started otherwise. The current message to others would be – simple things rigorously done can bring you very far, you don’t need to have a lot of prior knowledge to participate.

AB: My first (serious) competition was the DecMeg challenge in 2014. It was pretty much the topic of my first post-doc (cross-subject classification of evoked potential), so I decided to enter and see how well the stuff I developed would do in a competitive environment. Looking back on this one, I was a very inexperienced Kaggler. I just made a few submissions and stopped working. Luckily for me, I nailed an unsupervised procedure that gave me a comfortable advance and I won the challenge with a submission made 45 days before the deadline

FL: I first heard of Kaggle in 2014 and start competing on it since 2015. When I first tried two Kaggle competitions, I found Kaggle is a great platform. People share their original ideas with each other in the forum and the winners’ solutions for each competition always impress me. In every competition I attended, I could always learn something and accumulated my experience on different kind of dataset.

GT: After Google AI challenge 2011 I was looking for other challenges when I found Kaggle. My first competition was Wind Forecasting 2012 and I luckily get a solo third place using a blend of Neural Networks models. Since then I became addicted

What made you decide to enter this competition?

AT: Good chances to win by battling others’ better predictive modelling skills with domain knowledge. I participated in the other seizure detection/prediction challenges ending up in top 10 in both of them. When I saw this one I knew I had good chances to win.

AB: As I mentioned before, I was not very successful in the 2 previous seizure competitions. The first one was after the DecMeg challenge, and I applied a similar strategy. I was on top of the LB for a couple of weeks and then I stopped working on it. I ended up 38th. I guess there is a limit to luck … lesson learned: work hard until the end and team up when you have the occasion!

The second seizure challenge followed shortly after that, so for this one I went all in. I teamed with two other very good Kagglers, and we climbed the public leaderboard up to the second place … and overfitted very badly in the process to end up in 26th place (with a drop of 0.11 AUC, hard to beat that). Thanks to this one, I really learned to fear overfitting!

So when I saw this 3rd seizure competition, I knew it was the right one for me.

FL: I need to finish a one-year long project in order to get my Masters degree. It’s a coincidence to attend this competition because I was looking for a topic which is relevant to machine learning at that time. I noticed this competition is classified as research competition which meets my project requirement. I started doing this competition at the very beginning so I have enough time to read relevant papers and previous solutions.

GT: Get some domain knowledge with the best teammates I could find.

Let’s get technical

How did you spend your time on this competition?

The first difficulty of this challenge was to build a reliable cross-validation procedure. Training and test data were recorded on two different time periods, and it was not possible to emulate this split in the CV. We had to proceed very carefully to avoid overfitting. So very roughly, here is how we spent our time in this challenge: 60% trying to find a proper CV routine. 20% trying to smile when another one was of no use. 20% modeling and feature extraction.

We ended up using multiple CV procedures, trying to gauge relative score improvement of our models at each iteration. We tried to avoid over-tuning parameters of feature extraction and modeling, while maximizing the diversity of the approach in our ensemble of models.

What preprocessing and supervised learning methods did you use?

Since we couldn’t trust our CV score, our approach was to extract as many features as possible, to build as many models as possible and finally to pick the ones that seemed the most robust and diverse. Our team had good EEG domain knowledge, we also reviewed top 10 solutions of the 2 previous seizure challenges as well as the literature. That gave us a solid list of feature to chose from (FFT, correlation, coherence, AR error coefficient, etc …)

We then trained a bunch of different classifiers (XGB, SVM, KNN, LR). We limited ourselves to simple solutions with minimal parameter tuning. In this regard, Gilberto’s experience in guessing (almost-optimal) XGB parameters from the first iteration was a big deal.

What was your most important insight into the data?

The dataset was small, noisy, and CV was unpredictable and unreliable - It was a very challenging problem. So here is our more important insight: Diversity in the ensemble is the key to robustness. We use many simple and relatively low performing models rather than trying to hyper-optimize our best performing models (and overfit in the process). Simple feature sets works very well in this dataset. On the contrary, the performance drops down when we include more complex features sets.

Figure 1: Correlation map between each of the 11 individual models and the winning solution. The overall low correlation show a strong diversity in the models predictions.

The second most important insight came from experience when we had to pick our 2 final submissions. We believed the public leaderboard to be very overfitted (including our best public score) and that the private score will settle down around 0.8 AUC. We decided to minimize our risk by choosing as one of our final submissions a very conservative ensemble, tailored to be stable. It was a difficult choice, the public score of this ensemble was around 0.81 while our best score was 0.85. Luckily enough, our prediction was right and everyone went below 0.8 AUC on the private LB except for our stable ensemble that barely moved.

Which tools did you use?

Python, R, Matlab. All with various modules/toolboxes.

What was the run time for both training and prediction of your winning solution?

Ranging from a few hours to a few days (mainly due to the usage of Matlab and old toolboxes).

Words of wisdom

When CV is unreliable, don’t panic, simple things and basic ensembling (and teaming) provide a very stable solution.

Teamwork

How did your team form?

Contacting people by identifying those whom you can learn from and who can help the team to win.

How did your team work together?

Lots of chit-chatting, brainstorming, “a problem shared is a problem halved”.

How did competing on a team help you succeed?

Teaming brings together much more human resources and brainpower to solve a problem at hand. Eventually, it allows for a combination of many different solutions/directions that one alone would not have thought of. Last but not least, it allows to keep a strong motivation even when you are stuck and nothing seems to work.

Bios

Alexandre Barachant is a French Researcher, expert in Brain computer interfacing and Biosignal analysis. He received his Ph.D. degree in signal processing in 2012 from the Grenoble Alpes University, France. Since then, he has been a post-doc fellow at the Centre National de la Recherche Scientiﬁque (CNRS) in the GIPSA-lab Laboratory, Grenoble, France and at the Burke Medical Research Institute, Cornell University, New York. His research interests include statistical signal processing, machine learning, Riemannian geometry and classification of neurophysiological recordings.

Andriy Temko received his PhD degree in Telecommunication in 2008 from Universitat Politècnica de Catalunya, Barcelona, Spain. Since late 2008 he has been with the Irish Centre for Fetal and Neonatal Translational Research, University College Cork, Ireland, working on algorithms for the detection of brain injuries in the newborn. He is the author of more than 70 peer-reviewed publications and 2 patents. He developed and patented a novel cutting-edge neonatal seizure detection system which is currently undergoing European multi-center clinical trial towards its regulatory approval and clinical adoption. His research interests include acoustic and physiological signal processing, clinical decision support tools, and applications of machine learning for signal processing.

Feng Li is a Data Science program student in the University of Minnesota, Twin Cities.

Gilberto Titericz, Jr

is an electronics engineer with a M.S. in telecommunications. For the past 16 years he's been working as an engineer for big multinationals like Siemens and Nokia and later as an automation engineer for Petrobras Brazil. His main interests are in machine learning and electronics areas.

The Santander Product Recommendation competition ran on Kaggle from October to December 2016. Over 2,000 Kagglers competed to predict which products Santander customers were most likely to purchase based on historical data. With his pure XGBoost approach and just 8GB of RAM, Ryuji Sakata (AKA Jack (Japan)), earned his second solo gold finish by coming in 3rd place. He simplified the problem by breaking it down into several binary classification models, one for each product. Read on to learn how he dealt with unusual temporal patterns in the dataset in this competition where feature engineering was key.

The basics

What was your background prior to entering this challenge?

My university degree is in Aeronautics and Astronautics and I researched reliability engineering. There, I studied probability theory and statistics especially. Currently, I work for Panasonic Group as a data scientist for about 4 years, but I didn't have any knowledge of machine learning until starting my current work. Almost all of my knowledge of machine learning is based on my experiences from Kaggle competitions.

Ryuji Sakata (Jack (Japan)) on Kaggle.

How did you get started competing on Kaggle?

I joined Kaggle about three years ago in order to learn machine learning through practice. Now, I always want to enjoy Kaggle competitions when I have spare time.

What made you decide to enter this competition?

Before the launch of this competition, there was no running competition I could enter mainly because of data size. I have only 8GB laptop and it limits my participation in competitions. However, this competition allows me to compete with other Kagglers by using my own machine, and that’s why I entered.

Let's get technical

What was your most important insight into the data?

I inspected new purchase trends of each product, and I found that 2 specific products, cco_fin and reca_fin, had unusual trends (Figure A). Due to these unusual trends, to predict the new purchase of Jun 2016, I decided that cco_fin and reca_fin should be trained by data from different months compared to other products. Therefore, I decided to train models of each product separately by using different training data for each product rather than building just one model. (I ignored the peak of nom_pens of June because the peak of February was not periodic.)

Figure A.

What preprocessing and supervised learning methods did you use?

In this competition, extracting information from past purchase history of customers was very important. I made features as listed below:

ind_(xyz)_ult1_last: the last month index of the product (lag-1)
ind_(xyz)_ult1_00: the number of transition of index from 0 to 0 until last month
ind_(xyz)_ult1_01: the number of transition of index from 0 to 1 until last month
ind_(xyz)_ult1_10: the number of transition of index from 1 to 0 until last month
ind_(xyz)_ult1_11: the number of transition of index from 1 to 1 until last month
ind_(xyz)_ult1_0len: the length of consecutive 0 index until last month
products_last: concatenation of last month indices of products
n_products_last: the number of products purchased last month

Some of these are shown in the figures below. The feature products_last is not numeric, so it can’t be handled by XGBoost directly. It was replaced with numeric by mean value of the target variable (the height of each bar in the figure C).

Figure B.

Figure C.

The overview of training and ensemble is illustrated in the figure below. The training method I used is XGBoost only, and models of each product were trained separately as binary classification tasks. To ensemble predictions from different train data, they were normalized so that sum of probabilities of the 18 products became 1. After the normalization, multiple predictions of each product are log-averaged. Then, probabilities of all products were merged and the top 7 products were elected to make a submission.

Figure D.

Which tools did you use?

I used the R language including the packages data.table, dplyr and xgboost. I would like to master Python too in future.

What was the run time for both training and prediction of your winning solution?

The number of training process is 128 (18 products * 7 times + 2 products * 1 time).
Each training process took about 10 minutes, so the total estimated execution time is about 1280 minutes = 21 hours. Each prediction process took 1 minute or less, so the total execution time is about 2 hours.

Words of wisdom

What have you taken away from this competition?

I realize the importance of feature engineering through this competition. I think that one of the turning points of the game was how much information we could extract from the data rather than training methods or parameter tuning. It is worthwhile to take much time, I believe.

Do you have any advice for those just getting started in data science?

Let’s Kaggle together!

Bio

Ryuji Sakata works for Panasonic Group as a data scientist. He has been involved in data science for about 4 years. He holds a master's degree in Aeronautics and Astronautics from Kyoto University.

Background

I have MSc in computer science and work as a software engineer at Evil Martians.

Alexey on Kaggle.

I became interested in data science about 4 years ago - first I watched Andrew Ng’s famous course, then some others, but I lacked experience with real problems and struggled to get some. But things changed when around beginning of 2015 I got to know Kaggle, which seem to be the missing piece, as it allowed me to get experience in complex problems and learn from the others, improving my data science and machine learning skills.

So, for two years already I’ve participated in Kaggle competitions as much as I can, and it’s one of the most fun and productive pursuits I’ve had.

I noticed this competition during the end of Bosch Production Line Performance, and I became interested in it because of the moderate data size and mangled data, so I can focus on general methods of building and improving models. So I entered it as soon as I got some time.

Data preprocessing and feature engineering

First, I needed to fix skew in target variable. Initially I applied log-transform, and it worked good enough, but some time after I switched to other transformations like log(loss + 200) or loss ^ 0.25, which worked somewhat better.

Target variable and its transformations.

For features - first of all, I needed to encode categorical variables. For this I used basic one-hot encoding for some models, but also so-called lexical encoding, when value of encoded category is produced from its name (A becomes 0, B - 1, Z - 26, AA - 27, and so on).

I tried to find some meaningful features, but had no success at it. Also there were some kernels which provided insights into the nature of some variables, and tried to de-mangle them but I couldn’t get any improvement from it. So, I switched to using general automated methods.

The first of such methods was, of course, SVD, which I’ve applied to numerical variables and one-hot encoded categorical features. It helped to improve some high-variance models, like FM and NN.

Second, and more complex, was clustering the data and creating a new set of features based on the distance to cluster centers (i.e., applying RBF to them) - it helped to create a bunch of unsupervised non-linear features, which helped to improve most of my models.

And third, the last trick I used was forming categorical interaction features, applying lexical encoding to them. These combinations may be easily extracted from XGBoost models by just trying the most important categorical features, or better, analysing the model dump with the excellent Xgbfi tool.

First-level models

Based on these features, I built a lot of different models which I evaluated using the usual k-fold cross-validation.

First of all, there was linear regression, which gave me about 1237.43406 CV / 1223.28163 LB score, which is not very much of course, but provides some baseline. But after adding cluster features to it, it became 1202.70592 CV / 1189.64998 LB, which is much better for such a simple model.

Then, I tried scikit-learn RandomForestRegressor and ExtraTreesRegressor models, of which random forest was best, giving 1199.82233 CV / 1176.44433 LB after some tuning, and improved to 1186.23675 CV / 1166.85340 LB after adding categorical feature combinations. One problem with this model was that however scikit-learn supports MAE loss, it’s very slow and impossible to use, so I needed to use basic MSE, which has some bias in this competition.

The best model of scikit-learn which helped me was GradientBoostingRegressor, which was able to directly optimize MAE loss and gave me 1151.11060 CV / 1126.30971 LB

I also tried LibFM model, which gave me 1196.11333 CV / 1155.68632 LB in a basic version and 1177.69251 CV / 1150.37290 LB after adding cluster features to it.

But the main workhorses of this competitions were, of course, XGBoost and neural net models:

In the beginning, my XGBoost models provided something about 1133.00048 CV / 1112.86570 LB, but then I’ve applied some tricks which improved it to 1122.64977 CV / 1105.43686 LB:

Averaging multiple runs of XGBoost with different seeds - it helps to reduce model variance;
Adding categorical combination features;
Modifying objective function to be closer to MAE;
Tuning model parameters - I didn’t have much experience with it before, so this thread in Kaggle forums helped me a lot.

Custom objective function for XGBoost.

The other model that provided great results was neural net, implemented using the Keras library. I used basic multi-layer perceptron with 3 hidden layers which gave me somewhat about 1134.92794 CV / 1116.44915 LB in initial versions and improved to 1130.29286 CV / 1110.69527 LB after tuning and applying some tricks:

Averaging multiple runs, again;
Applying exponential moving average to weights of single network, using this implementation;
Adding SVD and cluster features;
Adding batch normalization and dropout;

Model tuning

In this competition, model hyperparameter tuning was very important, so I’ve contributed a lot of time in it. There are three main approaches here:

Manual tuning, which works good when you have some intuition about parameter behaviour and may estimate model performance before its training completes by per-epoch validation scores;
Uninformed parameter search - using GridSearchCV or RandomizedSearch from sklearn package, or similar - most simple of all;
Informed search using HyperOpt or BayesOptimization or similar package - it tries to fit some model to scores of different parameter sets and selects the most promising point for each next try - so it usually finds the optimum a lot faster than uninformed search.

I used manual tuning for XGBoost and NN models which provide per-epoch validation scores and bayes optimization package for the others.

Second level

After getting a lot of models, I combined them in the second level, training new models on out-of-fold predictions:

Linear regression, which gave me 1118.45564 CV / 1113.08059 LB score
XGBoost - 1118.16984 CV / 1100.50998 LB
Neural net - 1116.40752 CV / 1098.91721 LB (it was enough to get top-16 in public, and top-8 in private)
Gradient boosting - 1117.41247 CV / 1099.60251 LB

I haven’t had much experience with stacking before and so I was really impressed by these results, but wanted to get even more.

So, the first thing I did was correct the bias of some stacked models - linear regression and XGBoost optimized some objective which was not equal to the objective of the competition, which resulted in overestimating low values and underestimating high ones. This bias is really small, but the competition was very close, so every digit counted.

This bias can be seen on the next figure where logs of XGBoost predictions are plotted against target logs with and a median regression line. If it was unbiased, the median regression should be the same as diagonal, but it's not (offset is most visible where red arrows are).

XGBoost bias.

I took XGBoost predictions to some small power p (around 1.03), normalized to preserve its median, and it improved my score to 1117.35084 CV / 1099.63060 LB.

Not bad, but maybe I can get even more by combining predictions of these models?

Third level

So, I built a third level. As each new stacking level becomes more and more unstable, I needed something really simple here, which may optimize the competition’s metric directly. So, I chose to use median regression from the statsmodels package.

The main problem of this approach was lack of regularization, so it wasn’t very stable and had a lot of noise. To fight it I applied some tricks:

Training model on many subsamples and averaging predictions;
Reducing input dimensionality - grouping similar models of previous layers and using group averages as features;
Averaging best 10 submissions for a final one.

It allowed me to get 1098.07061 in public LB, and 1110.01364 in private, which corresponds to second place.

Final pipeline.

Lessons learned

So, this competitions helped me a lot, mainly in two main areas in which I lacked experience in before - model hyperparameter tuning (especially for XGBoost, where I got good hyperparameter intuition) and stacking, which I underestimated before.

Also, I tried a lot of different models and got intuition about how they work with different target, feature transformations, and so on.

Bio

Alexey Noskov is a Ruby and Scala developer at Evil Martians.

More from Alexey

Alexey shares more details on his winning approach on the competition’s forums including his winning competition code on GitHub.

Leaf Classification playground competition: Winners Kaggle Kernels

The Leaf Classification playground competition ran on Kaggle from August 2016 to February 2017. Over 1,500 Kagglers competed to accurately identify 99 different species of plants based on a dataset of leaf images and pre-extracted features. Because our playground competitions are designed using publicly available datasets, the real winners in this competition were the authors of impressive kernels.

In these sets of mini-interviews, you'll learn:

why you shouldn't always jump straight to XGBoost;

what makes feature engineering in Kernels more exciting than any MOOC; and

how to interpret visualizations of PCA and k-means, two of the most common unsupervised learning algorithms

Read on or click the links below to jump to a section.

Data Exploration

10 Classifier Showdown

Created by: Jeff Delaney
Language: Python

What motivated you to create it?

Because the Leaf Classification dataset is small, I wanted a script that could run a bunch of different classifiers in a single pass. My goal was to provide a basic starting point for the competition and showcase the many classification algorithms in Scikit-Learn. It’s tempting to go straight to XGBoost for a competition like this, so it was important to point out that simple linear algorithms can be powerful as well.

Comparing classifier accuracy and logloss in the kernel 10 Classifier Showdown in Scikit-Learn.

What did you learn from your analysis?

I learned that Linear Discriminant Analysis was an effective algorithm for this problem, which is not something I would normally expect or even be looking for. However, it makes sense given that the leaves all follow distinct geometric patterns.

3 Basic Classifiers and Feature Correlation

Created by: Federico C
Language: Python

What motivated you to create it?

The Leaf Classification competition was my first attempt at a Kaggle challenge. As a beginner in the field, I started looking around for tips and suggestions among the kernels from the Kaggle community and decided to put everything I learned together for future reference.

As a first step, I thus chose to write my own notebook, where I tested a number of methods on the given dataset and identified their strengths and weaknesses. This helped me understand the logic behind each step, get experience on how different classifiers work and clarified why each algorithm may be more or less suitable for different kind of datasets.

Hopefully, the notebook also helped other participants who, like me, were looking for a good starting point.

What did you learn from your analysis?

In this analysis, I focused on the three main classifiers: Naive Bayes, Random Forest and Logistic Regression. Each one of them has specific pros and cons depending on the structure of the dataset at hand and should be carefully considered.

How to set up basic classifiers in the kernel 3 Basic Classifiers and Feature Correlation.

A necessary first step is to dive into the data, explore and prepare them, removing superfluous or redundant information and reconstructing (where possible) missing data. The Leaf Classification is a good dataset from this point of view, it offers a wide range of features, some of which are correlated and can be reshaped in a number of ways. This stage can strongly affect the final score of your algorithm, as exemplified clearly in the case of Naive Bayes, where making sure that the assumption of uncorrelated features is respected makes all the difference in the final outcome. With Random Forest and Logistic Regression, features reduction on this dataset has a negligible impact.

Can you tell us about your approach in the competition?

I approached the competition mainly with the intention to learn and understand the fundamental concepts of classification algorithms of machine learning. Thus, instead of choosing a single algorithm and optimise it, I decided to systematically apply different classifiers to the dataset prior and after treating it with PCA for features reduction.

Feature Engineering

Keras ConvNet w/ Visualization

Created by: Abhijeet Mulgund
Language: Python

What motivated you to create it?

Compared to a lot of competitors, I’m fairly new to the Kaggle. I only joined last summer after an older friend of mine pointed me towards Kaggle while we talked about machine learning and courses. After dabbling in a few featured competitions, I moved to the Leaf Classification competition with 2 main goals: to score a <0.01 logloss, and to become involved in the Kernels and Discussions. Once I had achieved the former, I realized I had a solution based off a fairly simple idea, merging image data with Kaggle’s pre-extracted features, which would be perfect for people who might be stuck or interested in new ideas, making this solution the perfect way for me to get involved with Kaggle Kernels.

What did you learn from your analysis?

From my analysis I came up with one conclusion and one (untested) hypothesis. After examining the data through several of the data exploration kernels and posts, I noticed that data augmentation would help my neural net learn a lot of the invariances in the data like rotational, scale, and orientation invariance (I didn’t need to worry about shifting because I centered the images in my pre-processing phase). So I added data augmentation through keras to try to compensate for the size of the dataset. I found that some data augmentation did give a noticeable boost to my combined image and pre-extracted features model, but if I made the data augmentation too aggressive, it would do more harm than good.

Visualizing different convolutional layers in the kernel Keras Convnet with Visualizations.

When I first tried out my idea of merging the image and pre-extracted data, I was actually very surprised it even worked. I was fairly new to convolutional neural nets and was not quite sure why simply feeding the network the pre-extracted features produced such beautiful looking filters while trying to train purely on images produced garbage filters. Even with data augmentation, I could not get the pure image CNN to learn anything. After publishing my kernel and learning more about CNNs, I came up with a hypothesis. I think passing the pre-extracted features helps stabilize the learning a little bit. With randomly initialized weights trained from solely high-dimensional images, the CNN will have trouble learning anything. But the inclusion of the pre-extracted features helps stabilize the learning and helps with convergence during gradient descent. If this is in fact true, one consequence I hope to explore is whether initializing the convolutional weights of a pure image model with the convolutional layer weights of the combined model will give the pure image a nice set of filters that might allow it to learn during training.

Can you tell us about your approach in the competition?

I started this competition by simply feeding the pre-extracted features into a multi-layer perceptron with one hidden layer and got surprisingly good results, but I still had all this image data that I wasn’t using. My immediate thought then was to simply combine a convolutional neural network on the images with the pre-extracted features MLP and train the entire model end to end. To make up for the small size of the dataset I also threw data augmentation into the mix. Luckily, Keras already had built in image data augmentation for me to take advantage of.

Feature Extraction from Images

Created by: Lorinc
Language: Python

What motivated you to create it?

Even the best linear algebra MOOC puts me in sleep in seconds. Only working on something gives me enough reason to understand and comprehend the underlying principles. Therefore I often take on challenges I’m clearly not entitled to, and through countless hours of frustration (that I secretly enjoy a lot) I learn the subject. This time, I wanted to do the full lifecycle of a machine learning project, without relying too much on external libraries. And while I got way further than this notebook, I could not even finish the feature extraction, because I landed a dream job.

A step-by-step guide for extracting features from shapes by turning them into time-series in the kernel Feature Extraction from Images.

What did you learn from your analysis?

I never went deeper into any subject in my life, and I still know literally nothing. This business-maths-programming domain is not only huge, it is also infinitely deep. You can pick any niche topic in it and spend a lifetime mastering it. As I was sacrificing months of sleep reading up stuff at Wikipedia, I have learnt that there is an abyss of knowledge below the surface I considered as my universe until now. I wish someone opened this door for me when I was 13.

People underestimate the value of storytelling in data science. Magnitudes more valuable notebooks go down the drain unnoticed, because they are just a block of code. No story, no visuals. At the end, data scientists are just merchants, trying to sell their truths to their clients. Truth does not sell itself, take pride being a good merchant. For the greater good, of course.

https://github.com/lorinc/kaggle-notebooks/blob/master/extracted_leaf_shape.png

Insightful Visualizations

Visualization PCA and Visualizing k-means

Created by: Selfish Gene
Language: Python

What motivated you to create it?

Until recently I was a heavy Matlab user, and have a lot of code accumulated in Matlab over the past several years.

It’s funny but my main motivation for creating these two scripts was to force myself to translate the two extremely useful (for me personally) classes of GaussianModel and KmeansModel from Matlab to python.

Since Matlab is becoming less and less useful and python is already a much better tool in almost every aspect, I’m trying to completely back away from Matlab, and work exclusively with python.

Visualize "distance from cluster centers" feature space in the kernel Visualizing k-Means.

Visualizing "distance from cluster centers" feature space in the kernel Visualizing k-Means.

Another motivation was to illustrate how the results of the two simplest unsupervised algorithms, PCA and k-means, can be interpreted and visualized. I think it gives a lot of intuition about machine learning algorithms in general as it helps us understand how to think about objects (in this case, images) as points in a high dimensional space and understand what inner products in these high dimensional spaces actually mean.

How the leaf images vary around the mean image from the kernel Visualizing PCA.

What did you learn from your analysis?

I think applying basic visualization methods on data always helps understand how the data behaves and what are the main sources of variability in it. For example, I feel now that I understand a little bit more about what leaf shapes look like in real life.

Inspired to get started in computer vision? I recommend checking out our Digit Recognition getting started competition. You'll learn fundamentals of using machine learning to work with image data to classify handwritten digits in the famed MNIST dataset. We've curated a set of tutorials, but if you're brand new to Kernels, you can learn more here.

Outbrain Click Prediction Kaggle Competition 2nd Place Winners' Interview

From October 2016 to January 2017, the Outbrain Click Prediction competition challenged Kagglers to navigate a huge dataset of personalized website content recommendations with billions of data points to predict which links users would click on.

In this winners' interview, team brain-afk shares a deep dive into their second place strategy in this competition where heavy feature engineering gave a competitive edge over stacking methods. Darragh, Marios (KazAnova), Mathias (Faron), and Alexey describe how they combined a rich set of features with Field Aware Factorization Machines including a customized implementation to optimize for speed and memory consumption

The Basics

What was your background prior to entering this challenge?

Darragh Hanley: I am a part time OMSCS student at Georgia Tech and a data scientist at Optum, using AI to improve healthcare.

Darragh on Kaggle.

Marios Michailidis: I am a Part-Time PhD student at UCL, data science manager at dunnhumby and fervent Kaggler.

Marios (KazAnova) on Kaggle.

Mathias Müller: I have a Master’s in computer science (focus areas cognitive robotics and AI) and I’m working as a machine learning engineer at FSD.

Mathias (Faron) on Kaggle.

Alexey Noskov: I have an MSc in computer science and work as a software engineer at Evil Martians.

Alexey on Kaggle.

How did you get started with Kaggle?

Darragh Hanley: I saw Kaggle as a good way to practice real world ML problems.

Marios Michailidis: I wanted a new challenge and learn from the best.

Mathias Müller: Kaggle was the best hit for “ML online competitions”.

Alexey Noskov: I became interested in data science about 4 years ago - first I watched Andrew Ng’s famous course, then some others, but I lacked experience with real problems and struggled to get some. But things changed when around beginning of 2015 I got to know Kaggle, which seem to be the missing piece, as it allowed me to get experience in complex problems and learn from the others, improving my data science and machine learning skills.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Marios Michailidis: My work in dunnhumby as well as my PhD are focused in the recommendation space and specifically personalization.

Darragh Hanley: My job is currently focused on user personalization within the health industry, so I have had time to research the latest advances in the area. We are actively building out our Machine Learning and Big Data departments at Optum, Dublin, so reach out if you have an interest in joining.

General: some of us have experience in the recommendation space, however not specifically in the ad-click domain. Nevertheless we have collectively participated in many other data science challenges from Kaggle (such as the Avito competitions) with similar concepts.

A recommenders’ challenge

In this data challenge our team was tasked to predict which pieces of (anonymized) websites' contents the users of a web’s leading content discovery platform were likely to be clicked on.

The prediction had to be made using billions of historical data on-site behavior such as streams of page visits and clicks from multiple publisher sites in the United States between 14-June-2016 and 28-June-2016 as well as general information about the displayed (promoted) content.

Given a training set of 87 million Displays and Ads pairs, our team had to maximize mean average precision of @12 (MAP@12) for an unobserved test dataset with 32 million pairs.

The backbones of our solution are heavy feature engineering paired with the usage of “Field Aware Factorization Machines” (FFM). For the latter we used the original LibFFM as well as a custom implementation by Alexey optimized regarding speed and memory consumption.

Our general approach could be summarized with the following steps:

Setting up a reliable cross validation framework
Feature engineering
Generating strong single models (particularly FFM)
Stacking

Let’s get technical

Cross validation

Around 50% of the test data pairs were derived from days included in the training data. The latter 50% was including two future days. Subsequently, based on this information we tried to mimic that relationship via constructing a new (smaller) set of training and validation in order to test our models via cross-validation.

Although the training and test data had enough information in common, in terms of the general content being displayed, the actual users did not seem to overlap much between the 2 sets. This can also be illustrated below in the bottom right pie-chart. This fact further boosted our strategy for making our cross validation procedure very similar to the actual testing process.

Figure 1: Small overlap of users between train and test data.

As a result, we constructed a training data set of 14 million pairs and a validation data set of 6 million pairs. We sampled it according to the structure of the given test set which connoted a distribution of pairs where half were extracted from (2) future days and the rest from days present in the 14 million sample. The ratio of train-to-test remained stable, the smaller training set allowed for faster and more efficient validation, while the significant size of the data still ensured reliable results.

Apart from MAP@12 we were also recording log loss and AUC values, because the latter metrics can be a bit more informative for smaller in-model changes.

Feature Engineering

Feature engineering turned out to be the most important aspect in this competition in order to achieve a competitive score.

For use in FFM all features were hashed. Obvious features such as region, document source, publisher etc of the document provided uplift and can be seen in many of the public kernels.

Figure 2: Feature Hashing

On the public kernels a feature was published by Kaggle user rcarson indicating ads clicked by users in the page_views file, which were also present in the clicks train and test files. This feature was very strong bringing approximately .02 MAP@12 uplift. This was further improved via bucketing it into rows where the page_views doc was before, 1 hour after, 1 day after and > 1 day after the display timestamp.

It was beneficial to include competing ads - that is, all ads for a given display_id - on each row, so that the learner knew which were the alternative ads presented to the user for a particular ad choice. For this we hashed each individual competing ad as a feature. See the relevant figure below:

Figure 3: Competing ads per display id was predictive

We also hashed all combinations of the document/traffic_source clicked by a user in page_views. So, if a user came to a document from ‘search’, it would be treated differently to a user coming to the same document from ‘internal’. Any documents occurring less than 80 times in events.csv were dropped, because sparse documents tended to just add noise to the model. This information was quite strong in the model as it brought user level preferences in clicks.

We used document source_id as a means to aggregate the sparse documents and include them in the model in order to recover some of the lost information after filtering out documents, which occurred less than 80 times in events.csv. This was achieved by hashing the source_ids (from documents_meta) of all the page view documents for each user. We still excluded source_ids which occurred less than 80 times in events.csv, however there were much fewer cases now. An illustration of this can be seen in Figure 4 below.

Figure 4: <code>document_id</code> had very low occurring values, whereas we could capture the page view with the <code>source_id</code>

Figure 4: document_id had very low occurring values, whereas we could capture the page view with the source_id

Another useful feature was found via searching through page_views for documents clicked within one hour after the train and test displayed documents for a given user. Each document was hashed as a new feature.

As per usual, counts of category levels were important. Particularly counts of ads per display (see Figure 5) as this would have a high impact on the likelihood of a click:

Figure 5: large range of ad counts over different displays

We also computed simple counts of ads, documents, document sources etc. as well as their corresponding counts after conditioning on time (past versus future counts).

Besides, we obtained several flags to indicate, if the user ever viewed ad documents of similar category or similar topic and if the user viewed or clicked the given ad in the past.

We judged the quality of new features based on local CV as well as public LB scores.

Base Modelling

We trained around 50 different base models for which we varied the model parameters and input features to increase the diversity. Our 6M validation set has been used to create “out-of-fold” predictions and hence it became the training set at the Meta level.

Models used at level 1

Custom version of FFM by Alexey
LibFFM
LibFM
FTRL
Liblinear
XGBoost
Keras
Vopwal Wabbit models
Logistic regression
SVC

Our best single model (FFM) scored 0.70030 at the public and 0.70056 at the private leaderboard and hence would have placed 2nd on its own.

2nd-Level Stacking

The level 1 model predictions were stacked by training XGBoost and Keras models on the 6M at once. Hence, we didn’t performed common and future days separation, which other competitors reported to be an useful step. Instead, we used normalized time as an additional feature at level 2 to provide the valuable timeseries’ attribute of this dataset to the stacker models.

In a final step, the XGB & Keras level 2 predictions have been blended by use of weighted geometric mean: XGB0.7 * NN0.3, where the weights were chosen intuitively guided by MAP@12 scores on a 20% random subset of the 6M set as well as on the public LB.

Our final submission for this competition scored 0.70110 at the public and 0.70144 at the private leaderboard.

Bios

Mathias Müller (Faron) is a machine learning engineer for FSD Fahrzeugsystemdaten. He has a Master’s in Computer Science from the Humboldt University of Berlin. His thesis was about ‘Bio-Inspired Visual Navigation of Flying Robots’.

Alexey Noskov (alexeynoskov) is a Ruby and Scala developer at Evil Martians.

Brief intro

I am an MSc student in Data Analysis at Skoltech, Moscow. I joined Kaggle about a year ago when I attended my first ML course at university. The first competition was What’s Cooking. Since that, I’ve participated in several Kaggle competitions, but didn’t pay so much attention to it. It was more like a bit of practice to understand how ML approaches work.

Ivan Sosnovik on Kaggle.

The idea of Leaf Classification was very simple and challenging. Seemed like I wouldn’t have to stack so many models and the solution could be elegant. Moreover, the total volume of data was just 100+ Mb and the process of learning could be performed even with a laptop. It was very promising because the majority of the computations was supposed to be done on my MacBook Air with 1,3 GHz Intel Core i5 and 4 Gb RAM.

I have worked with black-and-white images before. And there is a forest near my house. However, it didn’t give me so much profit in this competition.

Let’s get technical

When I joined the competition, several kernels with top 20% scores were published. The solutions used the initially extracted features and Logistic Regression. It gave $logloss \approx 0.03818$ . And no significant improvement could be achieved by tuning of the parameters. In order to enhance the quality, feature engineering had to be performed. Seemed like no one had done it because the top solution had slightly better score than mine.

Feature engineering

I did first things first and plotted the images for each of the classes.

10 images from the train set for each of 7 randomly chosen classes.

The raw images had different resolution, rotation, aspect ratio, width, and height. However, the variation of each of the parameters within the class is less than between the classes. Therefore, some informative features could be constructed just on the fly. They are:

width and height
aspect ration: width / height
square: width * height
is orientation horizontal: int(width > height)

Another very useful feature that seemed to help is the average value of the pixels of the image.

I added these features to the already extracted ones. Logistic regression enhanced the result. However, most of the work was yet to be done.

All of the above-described features represent nothing about the content of the image.

PCA

Despite the success of neural networks as feature extractors, I still like PCA. It is simple and allows one to get the useful representation of the image in ${\rm I\!R}^N$ . First of all, the images were rescaled to the size of $50 \times 50$ . Then PCA was applied. The components were added to the set of previously extracted features.

Eigenvalues of the covariance matrix.

The number of components was varied. Finally, I used N=35 principle components. This approach showed $logloss \approx 0.01511$ .

Moments and hull

In order to generate even more features, I used OpenCV. There is a great tutorial on how to get the moments and hull of the image. I also added some pairwise multiplication of several features.

The final set of features is the following:

Initial features
height, width, ratio etc.
PCA
Moments

The Logistic Regression demonstrated $logloss \approx 0.00686$ .

The main idea

All of the above-described demonstrated good result. Such result would be appropriate for real life application. However, it could be enhanced.

Uncertainty

The majority of objects had certain decision: there was the only class with $p \sim 1.0$ and the rest had $p \lesssim 0.01$ . However, I found several objects with uncertainty in a prediction like this:

Prediction of logistic regression.

The set of confusion classes was small (15 classes divided into several subgroups), so I decided to look at the pictures of the leaves and check if I can classify them. Here is the result:

Quercus confusion group.

Eucalyptus and Cornus confusion group.

I must admit that Quercus’ (Oak) leaves look almost the same for different subspecies. I assume, that I could distinguish Eucalyptus from Cornus, but the classification of subspecies seems complicated to me.

Can you really see the random forest for the leaves?

The key idea of my solution was to create another classifier, which will make predictions only for confusion classes. The first one I tried was RandomForestClassifier from sklearn and it gave excellent result after the tuning of hyperparameters. The random forest was trained on the same data as logistic regression, but only the objects from confusion classes were used.

If logistic regression gave uncertain predictions for an object then the prediction of the random forest classifier was used. Random forest gave the probabilities for 15 classes, the rest assumed to be absolute .

The final pipeline is the following:

Final pipeline.

Threshold

The leaderboard score was calculated on the whole dataset. That is why some risky approaches could be used in this competition.

Submissions are evaluated using the multi-class logloss.
$logloss = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij}\log(p_{ij})$
where N,M - number of objects and classes respectively, $p_{ij}$ is the prediction and $y_{ij}$ is the indicator: $y_{ij} = 1$ if object is in class , otherwise it equals to . If the model correctly chose the class, then the following approach will decrease the overall logloss, otherwise, it will increase dramatically.

After thresholding, I got the score of logloss = 0.0 . That’s it. All the labels are correct.

What else?

I’ve tried several methods that showed appropriate result but was not used in the final pipeline. Moreover, I had some ideas on how to make the solution more elegant. In this section, I’ll try to discuss them.

XGBoost

XGBoost by dmlc is a great tool. I’ve used it in several competitions before and decided to train it on the initially extracted features. It demonstrated the same score as logistic regression or even worse, but the time consumption was a way bigger.

Submission blending

Before I came up with an idea of Random Forest as the second classifier, I tried different one-model methods. Therefore I collected lots of submissions. The trivial idea is to blend the submissions: to use the mean of the predictions or weighted mean. The result did not impress me either.

Neural networks

Neural networks were one of the first ideas I tried to implement. Convolutional Neural Networks are good feature extractors, therefore, they could be used as a first-level model or even as a main classifier. The original images came with different resolution. I rescaled them to $50 \times 50$ . The training of CNN on my laptop was too time-consuming to choose the right architecture in reasonable time, so I declined this idea after several hours of training. I believe, that CNNs could give accurate predictions for this dataset.

Bio

I am Ivan Sosnovik. I am a second-year master student at Skoltech and MIPT. Deep learning and applied mathematics are of great interest to me. You can visit my GitHub to check some stunning projects.

House Prices Advanced Regression Techniques Kaggle Playground Competition Winning Kernels

The House Prices playground competition originally ran on Kaggle from August 2016 to February 2017. During this time, over 2,000 competitors experimented with advanced regression techniques like XGBoost to accurately predict a home’s sale price based on 79 features. In this blog post, we feature authors of kernels recognized for their excellence in data exploration, feature engineering, and more.

In these sets of mini-interviews, you’ll learn:

how writing out your process is an excellent way to learn new algorithms like XGBoost;
if your goal is to learn, sharing your approach and getting feedback may be more motivating than reaching for a top spot on the leaderboard; and
just how easy it is to fall into the trap of overfitting your data plus how to visualize and understand it.

Read on or click the links below to jump to a section.

We’ve also renewed the challenge as a new Getting Started Competition, so we encourage you to fork any of these kernels or try out something completely new to expand your machine learning skill set.

Data Exploration/XGBoost

Fun with Real Estate Data

Created by: Stephanie Kirmer
Language: R

What motivated you to create it?

I had just learned about XGBoost, and was interested in doing a start to finish project comparing xgb to regression and random forest side by side. I though others might also want to be able to see how the procedures compared to each other. I learn a lot by doing, I am a hands-on learner as code is concerned, so this was a great opportunity for me.

Stephanie's kernel includes everything you need from data exploration and cleaning to model building including XGBoost.

Stephanie's kernel Fun with Real Estate Data includes everything you need from data exploration and cleaning to model building including XGBoost.

What did you learn from your analysis?

Oh goodness, I learned a lot. I have since learned even more, and my xgb implementations today are better, I think, but in this I learned about setting up the script so it would be smooth and make sense to the reader, so it would run at a reasonable speed (not that easy, random forests are slow), and I really put some work into the feature engineering and just simply learned a lot about how houses are classified/measured.

Can you tell us about your approach in the competition?

Entering the competition with my results was kind of an afterthought. I started work early on in this competition because it was a dataset that I could actually work with on the kernel structure or on my local machine (most competition datasets on Kaggle are way too big for my hardware/software to manage). I wanted to write the kernel first, and then it was just easy to enter from that interface so I did. I’m pretty proud of the results given it was my first reasonably competent implementation of XGBoost!

Comprehensive Data Exploration with Python

Created by: Pedro Marcelino
Language: Python

What motivated you to create it?

My main motivation was learning. Currently, I am looking to develop my skills as a data scientist and I found out that Kaggle is the best place to do it. According to my experience, any learning process gets easier if you can relate your study subject with something that you already know. In Kaggle you can do that because you can always find a dataset to fall in love with. That is what happened in my case. Having a background in Civil Engineering, the ‘House Prices: Advanced Regression Techniques’ competition was an obvious choice, since predicting house prices was a problem I had already thought about. In that sense, Kaggle works great for me and it is the place to go when I want to learn data science topics.

What did you learn from your analysis?

The most important lesson that I took from my analysis was that documenting your work is a great advantage. There are many reasons why I believe in this. First, writing helps clarify thinking and that is essential in any problem solving task. Second, when everything is well documented, it is easier to use your work for future reference in related projects. Third, if you document your work, you will improve the quality of the feedback you will receive and, consequently, you get more chances to improve your work. In the beginning, it might feel frustratingly slow to document everything, but you will get faster with practice. In the end, you will realize that it can be a funny exercise (and feel compelled to even add some jokes to your text).

The distribution of house prices by year built in Pedro's kernel Comprehensive Data Exploration in Python.

Can you tell us about your approach in the competition?

My approach was to focus on a specific aspect of the data science process and look for a solid bibliographic reference that could guide my work. For this competition, I opted to improve my skills in the aspect of data exploration. As a bibliographic reference, I used the book ‘Multivariate Data Analysis’ (Hair et al., 2014), in particular its Chapter 3 ‘Examining your data’. Since the book is well organized and written in a straightforward way, it is easy to follow it and use it as bridge between theory and practice. This is the approach that I usually follow when I am learning the basics: define the problem that I want to solve, look for related references and adapt them to my needs. Nothing more than ‘standing on the shoulders of giants’.

Pre-Processing and Feature Engineering

A Study on Regression Applied to the Ames Dataset

Created by: Julien Cohen Solal
Language: Python

What motivated you to create it?

Playground competitions are all about learning and sharing. I’m no expert at all, and was even less so when I published this kernel, but most of what I learned about machine learning models, I learned on Kaggle. If I recall correctly, I published it pretty early in the competition. There were already a few really interesting kernels, but I felt some of the work I had done hadn’t been presented anywhere else thus far, so there was my opportunity to share.

Around this period, I also just finished reading a book which I really liked (Python Machine Learning by Sebastian Raschka), and I couldn’t wait to apply some of the things I had read about on a dataset which looked interesting to me. When my first few submissions scored pretty decent results (at that moment at least), I figured my code was probably good enough that a few people could probably learn a thing or two reading it, and I could also maybe get some feedback to improve it.

Julien's kernel A Study on Regression Applied to the Ames Dataset uses every trick in the book to unleash the potential power of linear regression.

What did you learn from your analysis?

Well, strictly speaking about the topic of house prices, it confirmed what was pretty much universally known: it’s all about location and size. Other features like overall quality matter as well, but much, much less.

Now about applying machine learning to real-world problems, this was a real learning experience for me. First of all, feature engineering is fun! It’s definitely my favorite part of the overall process, the creativity aspect of it, especially on a dataset like this one where features aren’t anonymized and you can really focus on trying to improve your dataset with new features that make sense, not just blindly combine features and try random transformations on those.

Also, applying regularization when using Linear Regression is pretty much essential. It penalizes extreme parameter weights and as such allows us to find a much better bias/variance tradeoff and to avoid overfitting.

Can you tell us about your approach in the competition?

Right from the start, I knew I wouldn’t try to aim for the top spots (I wouldn’t be able to anyway!). I’m not really interested in stacking tens or hundreds of finely-tuned models, which seems to be pretty much necessary to win any Kaggle competition these days. I was in it to test some techniques I had heard about, and learn some new ones via the forum or the kernels.

I tried to mix the most interesting ideas I could read about in the kernels that were already published, mix those with my own, and go from there. I was using cross-validation to validate every single preprocessing concept, but it was at times a frustrating process, as the dataset is really small, and it was hard to distinguish the signal from the noise. Some features that made so much sense to me were actually downgrading my score. All in all it was still a great learning experience, and I’m happy I have been able to share some knowledge as well.

Insightful Visualizations

A Clear Example of Overfit

Created by: Osvaldo Zagordi
Language: R

What motivated you to create it?

What I liked about this competition was its small dataset, which allowed me to experiment quickly on my old laptop, and the fact that everybody can easily have a feeling of what a house price is. Your RMSE is zero-point-something, but how does it translate on the prediction? Am I off the price by one thousand, ten thousand, or one hundred thousand? Plotting the predicted vs. the actual prices was then a natural thing to do.

What did you learn from your analysis?

Trivially, I learned that overfitting can hit me much more then I would have expected. The prediction of gradient boosted trees on the training set is extremely good, almost perfect! Of course, I was in a situation of slightly more than 50 predictors and 700 observations. Still, it was remarkable to observe the trees adapt so perfectly to the observations.

Osvaldo delivers in his kernel A Clear Example of Overfitting.

Can you tell us about your approach in the competition?

Competing at high level in Kaggle quickly becomes hard. I was experimenting exactly with a method aimed at avoiding overfitting when I made that observation. In the end I did not write a kernel on this technique (maybe next time), and I did not even invest much energy in trying to climb the leaderboard. But I decided it was worth writing a kernel showing the example of the overfit. I thought it was somehow “educational”. In general, communicating the results is my favourite part of the whole analysis, even more so for this playground competition.

Leaf Classification Playground Competition: Winning Kernels: Read more about this competition that challenged over 1,500 Kagglers to accurately identify 99 different species of plants based on a dataset of leaf images. In a series of mini-interviews, authors of top kernels from the competition share everything from why you shouldn't always jump straight to XGBoost to visually interpreting PCA and k-means.

Dogs versus Cats Redux Kaggle Playground Competition Winners Interview

The Dogs versus Cats Redux: Kernels Edition playground competition revived one of our favorite "for fun" image classification challenges from 2013, Dogs versus Cats. This time Kaggle brought Kernels, the best way to share and learn from code, to the table while competitors tackled the problem with a refreshed arsenal including TensorFlow and a few years of deep learning advancements. In this winner's interview, Kaggler Bojan Tunguz shares his 4th place approach based on deep convolutional neural networks and model blending.

The basics

What was your background prior to entering this challenge?

I am a Theoretical Physicist by training, and have worked in Academia for many years. A few years ago I came across some really cool online machine learning courses, and fell in love with that field. I’ve been doing some freelancing data science and machine learning work for a while, and now I work for a FinTech startup.

Bojan Tunguz on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

When I was growing up my family owned several cats. I also watch a lot of online cat videos, so I feel I have a pretty good idea of what cats look like and how they differ from dogs.

So far I have not had any “official” experience in computer vision outside of Kaggle competitions. However, I have competed pretty successfully in a few other image recognition/categorization competitions, and I count this as one of my core machine learning competencies.

How did you get started competing on Kaggle?

I’ve been hearing about Kaggle for years, but finally decided to take the plunge and start competing about a year and a half ago (September 2015). I was initially apprehensive about competing on such a high level, but Kaggle’s community, kernels, discussions, etc., were very useful and helpful in getting me up to speed.

What made you decide to enter this competition?

There are several things I liked about the Dogs vs. Cats Redux competition that made me want to spend a lot of my time on it. As I already mentioned, I like image categorization competitions, and on average I do pretty well on them. This competition also seemed as “pure” of a machine learning categorization problems as they come: just two, perfectly balanced categories, with enough data to build sophisticated models. As I tried a few early solutions, the problem seemed pretty “blendable,” i.e. blending solutions from different models would generally improve the public leaderboard score. This suggested to me that building advanced “meta” models would be relatively straightforward to do. The competition also started at the time when I didn’t see many other interesting competitions on Kaggle. I also liked the fact that this was a repeat of a competition that was hosted on Kaggle before, so it was interesting to compare the methods and solutions from that competition, and see how far the field of image classification has progressed in just a few years. Finally, since this was a “Playground” competition that ran for about half a year, it gave me ample time and opportunity to try out different strategies and refine my image classification skills without the added pressure of being one of the “Featured” competitions.

Let’s get technical

Did any past research or previous competitions inform your approach?

Image classification problems have by now become almost commoditized, and there are a lot of good papers, tools, and software libraries the help you get started. The Deep Learning community has been generously offering many of their pretrained models for free, and these would be prohibitively expensive and time consuming to train “from scratch”. I have also benefitted from my experience with other previous and current image classification competitions (Yelp Restaurant Photo Classification, State Farm Distracted Driver Detection, Nature Conservancy Fisheries Monitoring, etc. ) which have greatly helped with refining my workflow.

What preprocessing and feature engineering did you do?

I spent relatively little time on preprocessing and feature engineering. I had split data for various cross validation folds on disk, in order to ensure the full consistency across multiple models/machines, as well as for easier access by various command line tools that I used. For one of my models I’ve done a lot of image augmentation - cropping, shearing, rotating, flipping, etc.

What supervised learning methods did you use?

Just like with most other image recognition/classification problems, I have completely relied on Deep Convolutional Neural Networks (DCNN). I have built a simple convolutional neural network (CNN) in Keras from scratch, but for the most part I’ve relied on out-of-the-box models: VGG16, VGG19, Inception V3, Xception, and various flavors of ResNets. My simple CNN managed to get the score in the 0.2x range on the public leaderboard (PL). My best models that I build using features extracted by applying retrained DCNNs got me into the 0.06x range on PL. Stacking of those models got me in the 0.05x range on PL. My single best fine-tuned DCNN got me to 0.042 on PL, and my final ensemble gave me the 0.35 score on PL. My ensembling diagram can be seen below:

Ensembling diagram

Which tools did you use?

I have primarily used Keras and a Facebook implementation of pretrained ResNets. The latter is written in Torch, so as I am not proficient in Lua, I had to develop all sorts of hacks to get the output of command line tools into my main Python scripts. I have also used OpenCV, XGBoost and sklearn for image manipulation and stacking.

How did you spend your time on this competition?

I have not done much feature engineering for this competition. Only one of my cross-validation models used significantly augmented images for its training input. I would say that for this competition I spent about 5% of my time on feature engineering, and the rest on machine learning.

What does your hardware setup look like?

I’ve built my own Ubuntu desktop box specifically for machine learning projects - i7 Intel processor, ASUS motherboard, 32 GB of RAM and dual NVIDIA GTX 970/960 cards. I’ve built and trained most of my models for this competition on that machine. Recently I’ve been able to avail of a System76 laptop with 64 GB of RAM and NVIDIA GTX 1070 GPU, but I have not used it for any of my most advanced models.

What was the run time for both training and prediction of your winning solution?

The most elaborate model that I used for this competition was a 10-fold CV 269-layer deep ResNet. It took about 15 hours to train each fold on my machine, so that translates into about 6 days training. The prediction phase was about 20 minutes per fold, so about three and a half hours total. As I mentioned above, I’ve viewed this competition as a good practice for learning how to fine tune neural networks for image recognition/classification problems, and and over the course of its duration I’ve spent many weeks worth of computational time on various different models.

Words of wisdom

What have you taken away from this competition?

Given enough clean, well-defined image data, deeper the CNN model the better.

Looking back, what would you do differently now?

I would try harder and start earlier to look into training really deep neural networks from scratch. I’ve been able to train a ResNet 50 from scratch, and it outperformed the fine tuned pertained model. However, I have not been able to do the same with the deeper NNs - my training was stuck in a rut. I would also invest more time in the localization of cats and dogs in images, and maybe even train a separate NN for that task. I would also look into getting additional data form other sources, since this seems to be allowed by the competition rules.

Do you have any advice for those just getting started in data science?

Just do it! If you are interested in data science, just start reading available resources, taking online classes, and, of course check out Kaggle competitions and tutorials. Regardless of your previous level of competence in coding and statistics, I believe the best way to get started with data science is just to take the plunge and start working on some projects. Learn R and/or Python, the two most popular languages with data scientists. Look at other people’s code, and then play with it and modify it to see what happens. Don’t be intimidated by the complex-sounding terms and algorithms.

I have taken several different online courses, and I would recommend the ones offered through Coursera and Udacity. Check out the Kaggle tutorial competitions: Digit Recognizer, Titanic and House Prices. They provide a lot of useful kernels that you can play with and modify. Go through Kaggle discussion boards - they too have tons of useful information. Don’t hesitate to ask questions - we’ve all been “noobs” at some point, and it was thanks in no small part to those who were patient enough to explain some “simple” concepts to us that we finally got where we are now.

Bio

Bojan Tunguz works for ZestFinance as a Machine Learning Modeler. He has been involved in data science and machine learning for about 3 years. He holds BS and MS degrees in Physics and Applied Physics from Stanford University, and a Ph.D. in Physics from University of Illinois at Urbana-Champaign. He currently doesn’t own any dogs or cats, but hopes that this state of affairs will not long endure.

Cats versus Dogs Kaggle Kernels Redux Playground Competition Winner's Interview Marco Lugo

The second iteration of the Dogs vs. Cats Redux playground competition challenged Kagglers to once again distinguish images of dogs from cats. This time relying on advances in computer vision and new tools like Keras. In this winner's interview, Kaggler Marco Lugo shares how he landed in 3rd place out of 1,314 teams using deep convolutional neural networks: a now classic approach. One of Marco's biggest takeaways from this for-fun competition was an improved processing pipeline for faster prototyping which he can now apply in similar image-based challenges.

The basics

What was your background prior to entering this challenge?

I am an economist by training and have been submerged in econometrics, which I would describe as the more classical branch of statistics where the main economic focus is often in policy and therefore on causality.

Marco Lugo on Kaggle.

I started programming with the C language about two decades ago and have always strived to keep learning about programming, eventually landing on R which made me discover machine learning in 2013 - I was instantly hooked on predictive modeling.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I have tried various computer vision datasets in the past but nothing that had forced me to push the envelope on hyperparameter optimization. This was my first image-related competition.

How did you get started competing on Kaggle?

I believe it was one night where I was searching how to do something with R and as it turns out, the code that ended up helping me to understand how it was done was found on the Kaggle website. I explored the site at that time and decided to enter a competition for fun, applying a linear regression thinking that it would be easy but ended up obtaining a less-than-stellar outcome instead. It was that somewhat humbling result that pushed me into machine learning.

What made you decide to enter this competition?

I was taking the excellent deep learning course by Jeremy Howard, Kaggle’s ex-president and ex-Chief Scientist, and one of the homework assignments was to enter the competition and get a top 50% ranking. I did my homework.

Let’s get technical

Did any past research or previous competitions inform your approach?

The online notes for Stanford’s CS231n course by Andrej Karpathy were particularly useful. Also, the Kaggle Blog winner's interviews were good to spark new ideas when my score started to stall.

What preprocessing and feature engineering did you do?

I randomly partitioned the data to create a validation set containing only 8% of the training set. I also demeaned and normalized the data as needed and used data augmentation to varying degrees.

What supervised learning methods did you use?

I used deep convolutional neural networks, both trained from scratch and pre-trained on the ImageNet database. My ensemble was a weighted average of the following models:

1 VGG16 pre-trained on ImageNet and fine-tuned.
2 ResNet50s pre-trained on ImageNet and fine-tuned.
1 ResNet50 trained from scratch.
3 Xception models pre-trained on ImageNet and fine-tuned.
Features extracted from pre-trained InceptionV3, Resnet50, VGG16, Xception, used as an input to (1) Microsoft’s implementation of gradient boosting, lightGBM and (2) a 5 layer neural network.
2 VGG-inspired convolutional neural networks trained from scratch.

Diagram of Marco's ensemble model.

Were you surprised by any of your findings?

I was pleasantly surprised by the effect of adding relatively poor performing models into the mix. It was also interesting to play around with the different variations of rectified linear units (ReLU) as switching from standard ReLU to Leaky ReLU and Randomized Leaky ReLU had a noticeable impact.

Which tools did you use?

I used Keras, developed by François Chollet from Google, for all of the neural networks with both the Theano and TensorFlow back-ends depending on the type of model I had to run. While the vast majority of the work was done in Python, I did use R to run the lightGBM model.

What does your hardware setup look like?

It’s a relatively old Windows 7 machine running on a AMD FX-8350 CPU, with 24GB of RAM and a NVidia GTX1060 6GB GPU. I also run an Ubuntu virtual machine on it but most of the work is done on Windows. I plan on upgrading soon.

What was the run time for both training and prediction of your winning solution?

I remember that some models took over 74 hours to train as I often trained for hundreds of epochs but I cannot put an exact number on all the iterations and models that I had to run, I would estimate it at 3 or 4 weeks of running time. Predicting for the full test set took under an hour.

Words of wisdom

What have you taken away from this competition?

I learned how important it is to properly understand the evaluation function. It was worth my time to sit down with pen and paper to explore the mathematical properties of the logarithmic loss function. Understanding the formula is not the same as understanding its impact.

Looking back, what would you do differently now?

I would have set my processing pipeline much earlier in the competition. I only did it after cracking the top 40% and, unsurprisingly, it enabled faster prototyping and thus allowed me to start to make real gains on a daily basis. It is also worth the investment as it can be easily reused. I was able to quickly recycle it for the Cervical Cancer Screening competition and land a top 10% position from the start building on the same setup.

Do you have any advice for those just getting started in data science?

I would highly recommend trying out as many different problems as you can and getting your hands dirty even if you do not fully grasp the theory behind it at the beginning. I will steal a page here from Jeremy Howard’s deep learning course and refer you to a short essay that perfectly illustrates this point: A Mathematician’s Lament by Paul Lockhart.

Bio

Marco Lugo currently works as a Senior Analyst at Canada Mortgage and Housing Corporation. He holds a B.Sc. in Economics and Philosophy and a M.Sc. in Economics from the University of Montreal.

More on No Free Hunch

Want to see how others have tackled the Dogs versus Cats playground competition? Check out Kaggler Bojan Tungu'z winner's interview.

Dstl Satellite Imagery Kaggle Competition Winners Interview Kyle Lee

Dstl's Satellite Imagery competition, which ran on Kaggle from December 2016 to March 2017, challenged Kagglers to identify and label significant features like waterways, buildings, and vehicles from multi-spectral overhead imagery. In this interview, first place winner Kyle Lee gives a detailed overview of his approach in this image segmentation competition. Patience and persistence were key as he developed unique processing techniques, sampling strategies, and UNET architectures for the different classes.

The Basics

What was your background prior to entering this challenge?

During the day, I design high-speed circuits at a semiconductor startup - e.g. clock-data recovery, locked loops, high-speed I/O, etc. - and develop ASIC/silicon/test automation flows.

Kyle on Kaggle.

Even though I don’t have direct deep learning research or work experience, the main area of my work that has really helped me in these machine/deep learning competitions is planning and building (coding) lots and lots of design automation flows very quickly.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

The key competition that introduced me to the tools and techniques needed to win was Kaggle’s “Ultrasound Nerve Segmentation” that ended in August 2016 (and I saw many familiar names from that competition in this one too!).

Knowledge accumulated from vision/deep learning related home projects and other statistical learning competitions has also helped me in this effort. Patience picked up from running and tweaking long circuit simulations at work over days/weeks were transferable and analogous to neural network training too.

Like many of the competitors, I didn’t have direct experience with multi-spectral satellite imagery.

How did you get started competing on Kaggle?

I joined Kaggle after first trying to improve my 3-layer shallow networks on Lasagne for single-board computer (SBCs, e.g. Raspberry Pi) stand-alone inferencing/classification systems for various home/car vision hobbyist projects, and wanted a more state of the art solution. Then I came across Kaggle’s State Farm Distracted Driver contest, which was a perfect fit. This was after completing various online machine learning courses - Andrew Ng’s Machine Learning course, Geoffrey Hinton’s course on Neural Networks, to name a few.

This was early 2016 - and it’s been quite a journey since then!

What made you decide to enter this competition?

As I mentioned earlier I participated in one of the earliest segmentation challenges on Kaggle - the “Ultrasonic Nerve Segmentation”. In that competition, I was ranked 8th on the public leaderboard but ended up as a 12th on private LB - a cursed “top silver” position (not something any hard worker should get!). Immediately after that I was looking forward to the next image segmentation challenge, and this was the perfect opportunity.

More importantly, I joined to learn what neural/segmentation networks have to offer apart from medical imaging, and to have fun! Over the course of the competition, I definitely achieved this goal since this competition was extra fun - viewing pictures of natural scenery is therapeutic and kept me motivated everyday to improve my methodology.

Let’s get technical

What was your general strategy?

In summary my solution is based on the following:

Multi-scaled patch / sliding window generation (256x256 & 288x288 primary, 224x224, 320x320 added for ensembling), and at edges the windows overlapped to cover the entire image.
U-NET training & ensembling with a variety of models that permuted bands and scales
Oversampling on rare classes - oversampling was performed by sliding in smaller steps over positive frames and sliding in larger steps over negative frames than default window size.
Index methods for waterways - namely a combination of Non-Differential Water Index (NDWI) and Canopy Chlorophyl Content Index (CCCI)
Post-processing on roads, standing water versus waterways, and small versus large vehicles. This post-processing resolved class confusion between standing water and waterways, cleaned up artifacts on the roads, and gave some additional points to the large vehicle score.
Vehicles - I did some special work here to train and predict only on frames with roads and buildings. I also only used RGB bands, a lot of averaging, and used merged networks (large+small) for large vehicle segmentation.
Crops - The image was first scaled to 1024x1024 (lowered resolution), then split into 256x256 overlapping sliding windows.

A VISUAL OVERVIEW OF THE COMPLETE SOLUTION (ALL CLASSES)

What preprocessing and feature engineering did you do?

I performed registration of A and M images, and used sliding window at various scales. In addition, I also oversampled some of the rare classes in some of the ensemble models. The sliding window steps are shown below:

PATCH DISTANCE FOR OVERSAMPLED CLASSES

Oversampling standing water and waterway together was a good idea since it helped to reduce the amount of class confusion between the two, with reduced artifacts (particularly for standing water predictions).

As far as band usage is concerned, I mostly used panchromatic RGB + M-band and some of the SWIR (A) bands. For the A-bands I mostly did not use all the bands, but randomly skipped a few bands to save training time and RAM.

As mentioned earlier, for vehicles I trained and predicted only on patches/windows with roads and/or buildings - this helped to cut down the amount of images needed for training, and allowed for significant oversampling of vehicle patches. This scheme was applied also on test images, so results are pipelined as you can see from the flowchart.

Finally, preprocessing involved the use of mean/standard deviation normalization using the training set - in other words, each training/validation/test patch was subtracted by the mean and divided by the standard deviation of the training set only.

What supervised learning methods did you use?

The UNET segmentation network from the “Ultrasonic Nerve Segmentation” competitions and other past segmentation competitions was widely used in my approach, since it is the most easily scalable/sizeable fully convolutional network (FCN) architecture for this purpose. In fact, if I am not mistaken, most - if not all - of the top competitors used some variant of the UNET.

I made tweaks to the original architecture with batch-normalization on the downstream paths + dropout on the post-merge paths, and all activation layers switched to Exponential Linear Unit (ELU). Various widths (256x256, 288x288, etc.) and depths were used depending on the various classes via cross-validation scores.

For example, in my experiments, the structure class converged best - both in terms of train time and CV - with a UNET that had a wider width (288x288) and a shallow depth (3 groups of 2x conv layers + maxpool).

VARIOUS UNET ARCHITECTURES FOR DIFFERENT CLASSES

Overall, I generated 40+ models of various scales/widths/depths, training data subsamples, and band selections.

FULL MODEL (WIDTH/DEPTH, SAMPLING, BANDS) LISTING OF ALL CLASSES

In terms of cross validation, I used a random patch split of 10-20% across images (depending on class, the rarer the larger). For oversampled classes only 5% random patch were used. Only one fold per model was used to cut down on runtime in all cases.

Training set was train-time augmented (both image+mask) with rotations at 45 degrees, 15-25% zooms/translations, shears, channel shift range (some models only), and vertical+horizontal flips. No augmentation with ensembling was performed on validation or test data.

Optimization wise I used the Jaccard loss directly with Adam as optimizer (I did not get much improvement from NAdam). I also had a learning rate policy step which dropped the learning rate at around 0.2 of the initial rate for every 30 epochs.

Ensembling involved the use of mask arithmetic averaging (most classes), unions (only on standing water and large vehicles), intersections (only on waterways using NDWI and CCCI).

What was your most important insight into the data?

My understanding is that most competitors had either weak public or private scores with standing water and vehicles, which I spent extra effort to deal with in terms of pre- and post-processing. I believe stabilizing these two (actually three) classes - standing water, large and small vehicles made a large impact on my final score relative to other top competitors.

Standing Water Versus Waterways

For standing water, one of the main issues with standing water was class confusion with waterways. As described earlier, oversampling both standing water and waterways helps to dissolve waterway artifacts in standing water UNET predictions, but there was still a lot of waterway-like remnants, as shown below in raw ensembled standing water predictions:

EXAMPLES OF MISCLASSIFIED POLYGONS IN STANDING WATER

The key to resolving this was to realize that from a common sense perspective - waterways always touch the boundary of the image, while standing water mostly does not (or has a small overlap area / dimension only). Moreover, the NDWI mask (generated as part of waterways) could be overlapped with the raw standing water predictions, and very close broken segments could be merged (convexHull) to form a complete contour that may touch the boundary of the image. In short, boundary contact checking for merged water polygons was part of my post-processing flow which pushed some misclassified standing water images into the waterway class.

Vehicles - Large and Small

The other important classes which I spent a chunk of time on were the two vehicle classes. Firstly, I noticed - both on the training data and just simply common sense - is that vehicles are almost always located on or near roads, and near buildings.

EXAMPLES OF SMALL VEHICLES RELATIVE TO ROADS AND BUILDINGS

By restricting training and prediction to only patches containing buildings and roads, I was naturally able to allow for oversampling of vehicle patches, and narrow down the scope of scenery for the network to focus on. Moreover, I chose only RGB images, since in all other bands vehicles were either not visible, or displaced significantly).

Secondly, many vehicles were very hard to distinguish between large and small classes both in terms of visibility (blurred) and mask areas. For reference, their mask areas from training data are shown in the histogram below, and there is a large area overlap between large and small vehicles from around 50-150 pixels^2.

CONTOUR/MASK AREA HISTOGRAM OF SMALL VS LARGE VEHICLES

To deal with this, I trained additional networks merging both small+large vehicles, and took the union of this network with large vehicle only network ensemble. The idea is that networks that merge both small+large are able to predict better polygons (since there is no class confusion). I then performed area filtering of this union (nominally at 200pixel^2) to extract large vehicles only. For small vehicles, it was basically just to take the average ensemble of small vehicle predictions, and remove whichever contours overlapped with large vehicles and/or over the area threshold. Additionally, both vehicle masks were cleaned by negating their masks with buildings, trees, and other classes.

Post-competition analysis showed that this approach helped large vehicle private LB score - which if I did not, would have dropped by -59%. On the other hand small vehicles did not have any improvement from the area threshold removal process above.

Were you surprised by any of your findings?

Surprisingly, waterways could well be generated using simple and fast index methods. I ended up with a intersection of NDWI and CCCI masks (with boundary contact checking to filter out standing water / building artifacts) rather than using deep learning approaches, thus freeing up training resources for other classes. The public and private LB score for this class seemed competitive relative to other teams who may have used deep learning methods.

Finally, here is my CV-public-private split per class.

FINAL LOCAL CV, PUBLIC LB, and PRIVATE LB PER CLASS COMPARISONS

The asterisk (*) for private LB score on crops indicate a bug with OpenCV’s findContours, that if I had used the correct WKT generating script for that class I would have had a crop private LB score of 0.8344 instead of 0.7089. As a result this solution could have achieved an overall private LB score of 0.50434 (over 0.5 - yay!) rather than 0.49272.

The bug had to do with masks spanning the entire image not being detected as a contour - I had only found this out after the competition and would have done a WKT mask dump ‘diff’ if I had the time. All other classes were using the correct shapely versions of the submission script.

My guess is that my vehicle and standing water scores (combined) were the ones that made a difference in this competition, since the other top competitors had either weak vehicle scores or weak standing water scores.

Which tools did you use?

Keras with Theano backend + OpenCV / Rasterio / Shapely for polygon manipulation.

No pretrained models were used in the final solution, although I did give fine-tuned (VGG16) classifier-coupling for merged vehicle networks a shot - to no avail.

How did you spend your time on this competition?

Since this was a neural network segmentation competition, most of time (80%+) was spent on tuning and training the different networks and monitoring the runs. The remaining (20%) was on developing the post and pre-processing flows. From a per class effort perspective, I spent over 70% of the overall time on vehicles, standing water, and structures, and I spent the least time on crops.

In terms of submissions, I used a majority of the submissions trying to fine tune polygon approximation. I first tried bounding boxes, then polygon approximation, and then polygon with erosion in OpenCV. Ultimately, I ended up using rasterio/shapely to perform polygon to WKT conversion. All classes (except trees) had no approximation, while trees were first resized to 1550x1550 - effectively approximating the polygons - before being converted to WKT format.

What does your hardware setup look like?

I used three desktops for this contest. The first two were used for all the training/inferencing of all classes, while the last one (#3) was only run on crops.

GTX1080 (8GB) + 48GB desktop system RAM
GTX1070 (8GB) + 48GB desktop system RAM
GTX960 (4GB) + 16GB desktop system RAM.

What was the run time for both training and prediction of your winning solution?

It took about three days to train and predict - assuming all models and all preprocessing scales can be run in parallel. One day for preprocessing, one day to train and predict, and another day to predict vehicles and generate submission.

Thank You

Once again, thank you to Dstl and Kaggle for hosting and organizing this terrific image segmentation competition - I believe this is by far the most exciting (and busy, due to the number of classes) competition I have had, and I am sure this is true for many others too.

It’s always interesting to see what neural networks can accomplish with segmentation - first medical imaging, now multi-spectral satellite imagery! I personally hope to see more of these type of competitions in the future.

Words of wisdom

What have you taken away from this competition?

A lot of experience training neural networks - particularly segmentation networks, working with multi-spectral images, and improving on traditional computer vision processing skills. Some of the solution sharing by the top competitors were absolutely fascinating as well - especially clever tricks with multi-scale imagery in a single network.

Looking back, what would you do differently now?

I would have added some ensembling to crops, added heat-map based averaging (and increase the test overlap windows at some expense of runtime), dilated structures training mask (which helped structure scoring for some competitors), and removed most of the expensive rare scale (320x320, for example) ensembling on tracks.

I would also have fixed the contour submission issue on crops had I caught that earlier.

Do you have any advice for those just getting started in data science?

Nothing beats learning by practice and competition, so just dive in a Kaggle competition that appeals to you - whether it be numbers, words, images, videos, audio, satellite imagery, etc. (and that you can commit to early on if you want to do well).
Moreover, data science is an ever evolving field. In fact, this field wasn’t even on the radar a decade ago - so be sure to keep to date on the architectural improvements year-by-year. Don’t worry, most other competitors are starting on the same ground as you, especially with some of the new developments.
Having more systems helps in terms of creating experiments and ensemble permutations, but it’s not absolutely necessary if you have a strong flow or network.
However, for this particular competition, having >= 2 GPU systems will definitely help due to the sheer number of classes and models involved.
Most importantly, have fun during the competitions - it won’t even feel like work when you are having fun (!) Having said that, I am still a beginner in many areas in data science - and still learning, of course.

Bio

Kyle Lee works as a circuit and ASIC designer during the day. He has been involved in data science and deep learning competitions since early 2016 out of his personal interest for automation and machine learning. He holds a Bachelor’s degree in Electrical and Computer Engineering from Cornell University.

March Machine Learning Mania 2017, 2nd Place Winner's Interview: Scott Kellert

Kaggle's annual March Machine Learning Mania competition returned once again to challenge Kagglers to predict the outcomes of the 2017 NCAA Men's Basketball tournament. This year, 442 teams competed to forecast outcomes of all possible match-ups. In this winner's interview, Kaggler Scott Kellert describes how he came in second place by calculating team quality statistics to account for opponent strength for each game. Ultimately, he discovered his final linear regression model beat out a more complex neural network ensemble.

The basics

What was your background prior to entering this challenge?

I work as a data scientist at Nielsen doing Marketing ROI Attribution for digital ads. I got my Bachelors in Industrial Engineering from Northwestern University and my Masters in Analytics at the University of San Francisco.

Scott Kellert on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I have no specific training or work experience in the field of sports analytics. However, as a die hard Oakland A’s fan, Moneyball is a near religious text for me. I have always tried to read up on the most cutting edge sports analytics trends. I have done side projects in the past to prep for fantasy football and baseball leagues as well as analysis on the existence of clutch hitters in baseball (there aren’t) and ongoing work on an all encompassing player value metric for the NHL.

How did you get started competing on Kaggle?

I started on Kaggle with the 2015 edition of the March Machine Learning Madness competition. I entered with two of my peers from grad school for our Machine Learning final project. While I have dipped my toes in a few other competitions, the March Madness competitions are the only ones I have pursued seriously and I have done so each year since.

Let’s get technical

Before I get to answering these questions, I would like to direct people to the website that I built using the results from this competition. While this competition was geared towards producing a probability for every possible match up, I built the website to be used as a guide for filling out a bracket. It also contains many of the outputs from my analysis in an easy to digest format. I hope you enjoy. (NB I am not a web developer. This site does go down occasionally and there are small bugs. I am always trying to improve though so feel free to reach out if you have some feedback.)

www.pascalstriangleoffense.com

Did any past research or previous competitions inform your approach?

With this being the third year in a row that I have participated in this competition, I was able to reuse a great deal of the work from prior years. However, I make a point to tweak and improve my model in significant ways each year.

What preprocessing and feature engineering did you do?

Preprocessing and feature engineering is the most important part of my process for this competition, and I believe it probably is for many competitors as well. Unlike some other Kaggle competitions, the training data does not come in a format that allows for any algorithms to be applied without preprocessing. Each row is a game box score which is information you will not have yet at the time of prediction, and it contains no information about the teams’ prior performance.

While there are many services that will provide analytically driven statistics on team quality (most notably KenPom), I set a goal to perform all the calculations myself. In college basketball the concept of adjusting team statistics for opponent strength is crucial. Teams play most of their games within their own conference and these conferences vary wildly in skill. Therefore, a team can produce inflated stats in a bad conference or deflated stats in a great conference. Adjusting these statistics for opponent strength will make a big impact on the quality of the predictions.

For example, we could describe a team’s offense by taking its average points scored across the season. Applying this to the 2015-2016 season, the top 5 offensive teams in the country would be Oakland, The Citadel, Marshall, North Florida, and Omaha. None of these teams even made the tournament. Applying the opponent adjustment algorithm reveals that North Carolina was, in fact, the best offense in the country. The Tar Heels made it all the way to the final.

The algorithm for applying the adjustment itself is relatively simple but is computationally expensive. The idea is that for every statistic we want to adjust we’ll give each team a relative score, meaning it has no units or direct interpretability. Every team starts with a score of 0 which reflects our lack of knowledge about the system before the optimization begins. The next step is to generate a score for every team and every game. This score is a reflection of how well that team did with respect to the given statistic in the given game. For point differential, which is the metric I use for overall team quality, the score is calculated using the pythagorean expectation for the game.

The rest of the statistics are absolute, meaning that unlike point differential there is no against portion of the formula. For example a team’s ability to produce blocks, ignoring the blocks they allow. In this case I produce a score using the p value produced from the normal distribution.

After game scores are calculated for every game, a team’s overall statistic score (which started at 0) is updated to be the average of the sum of their game score and their opponent’s overall score across all the games in the season. In the case of the absolute stats, the opponent’s score is their score for preventing that stat, not producing. Then the whole process is repeated until the scores converge. These scores typically end up being distributed between -1 and 1. The interpretation is that a team with a score of 0 would be expected to tie an average team in the case of point differential or produce an average amount of a given statistic against a team that is average in preventing it. As the score diverges from 0 there is not as simple an interpretation but it can be taken to mean absolute ability.

One additional wrinkle that I added this year was to create an opponent adjusted score for each game. In the past I created one score for the whole season, but this introduced data leakage in the training set. A team’s score was impacted by every game from the regular season but also used to predict outcomes from games in that season. By calculating these scores by game, I had a training set where I could predict regular season outcomes based on statistics from all the games from that season except for the game being predicted. This created a minimal but significant improvement to my results.

What supervised learning methods did you use?

Going into the competition I was committed to using Neural Networks as my driving algorithm. In the past I have tried many algorithms and ensembling techniques but Logistic Regression has always won out (I find this to be true a surprising amount of the time across all data science projects). After testing many parameters, I found that my Neural Net worked best with a single relatively small hidden layer and that it had high variance. This realization sparked the idea that I should be bagging my Neural Nets, and sure enough this finally allowed me to surpass Logistic Regression.

However, a later discovery that I will cover below caused Linear Regression to easily beat my excessively complex Neural Net Ensemble. Linear Regression was my final model.

What was your most important insight into the data?

As teased above, my most important insight was to predict continuous point spreads instead of binary wins/losses. I always knew that this was the better approach, but I hadn’t thought of a good way to convert those results into probabilities as the competition requires. In previous years I had tried to use regressors to predict the pythagorean expectations, described above, but these never performed as well as classifier solutions. This year it occurred to me that I could use the concept of the prediction interval from Linear Regression to produce probabilities. I simply calculated the standard error of my point spread predictions (typically around 10.55) and used the normal CDF to produce a probability. This approach performed significantly better than my classifier.

Which tools did you use?

I do almost all my work in ipython Jupyter Notebooks. I find the notebooks to be the easiest way to quickly iterate on both code as well as algorithms. Sklearn, numpy, scipy, and pandas are my drivers within python. Sklearn drives all my ML algorithms, scipy covers my statistical distributions, and numpy/pandas cover all my data engineering.

I will occasionally export data to R if I want to visualize something. This is mostly because I have much more experience with ggplot than with matplotlib.

How did you spend your time on this competition?

Over the course of the three years that I have worked on this competition, the feature engineering task of creating the opponent adjustment algorithm was the most work intensive piece. However, over the last two years, I have only had to tweak that code which gives me a lot more time to work on the machine learning component. I spent significant time this year improving my cross validation approach for testing new estimators and playing with Neural Nets that I did not end up using.

What was the run time for both training and prediction of your winning solution?

Most of the time is committed to running the opponent adjustment optimization. For all 15 seasons of data to be adjusted takes around two hours. Once that process is complete, training and applying the Linear Regression takes less than a minute.

Words of wisdom:

What have you taken away from this competition?

Small tweaks can make a big improvement in the leaderboard. My approach did not change all that much between this year and last year, but the small tweaks that I discussed above took me from finishing around the 60th percentile in 2015 and 2016 to second place this year. If you didn’t do well this year, you could be one minor adjustment away from placing next year!

Do you have any advice for those just getting started in data science?

Learn how to write production worthy code. It doesn’t really matter how good your algorithm is if no one can put it to use.
Interpretable results are frequently more important than the most accurate results. Logistic and Linear Regression may be old and boring but they are still frequently the most accurate and much more interpretable than other estimators.

Bio

Scott Kellert is a data scientist at Nielsen doing Marketing ROI Attribution for digital ads. He has his Bachelors in Industrial Engineering from Northwestern University and my Masters in Analytics at the University of San Francisco.

March Machine Learning Mania Kaggle Competition Winner's Interview Erik Forseth

The annual March Machine Learning Mania competition, which ran on Kaggle from February to April, challenged Kagglers to predict the outcome of the 2017 NCAA men's basketball tournament. Unlike your typical bracket, competitors relied on historical data to call the winners of all possible team match-ups. In this winner's interview, Kaggler Erik Forseth explains how he came in fourth place using a combination of logistic regression, neural networks, and a little luck.

The basics

What was your background prior to entering this challenge?

My background is in theoretical physics. For my PhD I worked on understanding the orbital dynamics of gravitational wave sources. While that work involved a healthy balance of computer programming and applied math, there wasn’t really any statistical component to it. In my spare time, I got interested in machine learning about two years prior to finishing my degree.

Erik Forseth on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

As a matter of fact, sports prediction – college basketball prediction in particular – has been a hobby of mine for several years now. I have a few models which are indefinite works in progress, and which I run throughout the season to predict the outcomes of games. So on one hand, entering the competition was a no-brainer. That said, March Madness is a bit of a different beast, being a small sample of games played on neutral courts by nervous kids.

Let’s get technical:

My 4th-place entry used the combined predictions of two distinct models, which I’ll describe in turn.

Model 1

Needless to say, there’s a rich history and a large body of work on the subject of rating sports teams. Common to most good ratings is some notion of “strength of schedule;” whether implicitly or explicitly, the rating ought to adjust for the quality of opponents that each team has faced.

Consider for example the Massey approach. Each team is assigned a rating r, where the difference between teams' ratings purports to give the observed point differentials (margins of victory) m in a contest between the two teams. So, one constructs a system of equations of the form:

$r_i - r_j = m_{ij}$

for teams and and outcomes $m_{ij}$ , and then solves for the ratings via least squares. The Massey ratings are a compact way of encoding some of the structure of the network of the roughly 350 Division I teams. They take into account who has beaten who, and by how much, for all games played by all teams.

Shown above is a representation of the network of DI NCAA basketball teams. Nodes on the perimeter represent single teams, and node size is proportional to the team’s Massey rating. Edges are drawn only for the final games played by each team.

And so, my first model was a straightforward logistic regression, taking a set of modified Massey ratings as input. My modified ratings differ from the original versions in that: (1) I augment the system of equations to include games played during the prior season, and (2) I rescale the system in such a way that recent games are given greater weight. These are all relatively straightforward computations done in Python using NumPy.

Model 2

The second model is a neural network trained on raw data instead of derived team strength ratings. I'm building these in Python with Theano. As far as I'm aware, most everyone who does sports prediction uses linear models based on team ratings of various flavors (Massey, ELO, RPI, etc., see above), and there hasn’t really been a compelling reason to do anything fancier than that. So, I've been interested in the question of whether or not I can get something as good or better using a different approach entirely. One of the main challenges for me here has been to figure out how to present the model with raw data in such a way that it can build useful features, while at the same time keeping memory under control (I'm confined to training these on my laptop for the time being). This is still very much a work in progress, as I'm continually playing with the input data, the architecture, etc. Nevertheless, prior to the competition I managed to get something which performed as well as my latest-greatest linear models, and so in the end I averaged the predictions from the two.

Finally, my 4th-place entry involved a bit of “gambling.” As I pointed out earlier, 63 games is a really small sample, and to make matters worse, you’re being scored on binary outcomes. You could get a little more resolution on the entries if Kaggle instead posed a regression problem, where competitors might be asked to predict point differentials and were then scored according to mean-squared-error. Or, you could even have competitors predict point differentials and point totals, equivalent to predicting the individual scores of each team in each matchup.

Regardless, the current formulation of the contest is interesting, because it requires a certain amount of strategy that it might not otherwise have if the only goal were to come up with the most accurate classifier on an arbitrarily large number of games. In this case, my strategy was:

There are only 63 games here….
I’m rewarded for being correct, but punished for being wrong.
Nevertheless, I believe in my underlying models, so I’m going to “bet” that their predictions tend to be right.
Therefore, I will take all of the predictions, and push those above 0.5 toward 1, and those below 0.5 toward 0. (I came up with a hand-wavy rule of thumb for pushing those near the extremes more than I pushed those near the middle. In other words, I wasn’t so sure about the predictions near 0.5, so I wanted to more or less leave those alone, whereas I wanted to get the most out of my stronger picks.)

What was the final effect of perturbing my predictions this way? I submitted the unperturbed predictions as my second entry, and it would’ve placed about 25th on the leaderboard, or still close to top 5%. I think this all goes to show that pure luck and randomness play a big role in this competition, but that there is headway to be made with a good model and a sound strategy.

Words of wisdom:

Do you have any advice for those just getting started in data science?

My advice to anyone with an interest in data science is to give yourself a project you’re interested it. Rather than setting out and trying to learn about a specific method, pose an interesting problem to yourself and figure out how to solve it. Or, if you’re absolutely intent on learning more about some particular toolbox, at least give yourself some interesting context within which you can apply those tools. To that end, writing yourself a web scraper can vastly increase your ability to get usable data to play around with.

Just for fun:

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

I don’t have a specific problem I’d pose; it’s neat to see the various challenges that crop up. Though, we’ve seen a lot of image recognition tasks recently. I think it would be interesting to have more time series prediction and sequence classification problems.

Dstl Satellite Imagery Kaggle Competition, 3rd Place Winners' Interview: Vladimir & Sergey

In their satellite imagery competition, the Defence Science and Technology Laboratory (Dstl) challenged Kagglers to apply novel techniques to "train an eye in the sky". From December 2016 to March 2017, 419 teams competed in this image segmentation challenge to detect and label 10 classes of objects including waterways, vehicles, and buildings. In this winners' interview, Vladimir and Sergey provide detailed insight into their 3rd place solution.

The basics

What was your background prior to entering this challenge?

My name is Vladimir Iglovikov, I work as a Sr. Data Scientist at TrueAccord. A couple years ago I got my PhD in theoretical physics at UC Davis, but choose not to go for postdoctoral position as most of my colleagues. I love science, I love doing research, but I needed something new, I needed a challenge. Another thing is that I wanted to get deeper experience with software engineering and data science which would aid me in my research. A few months before my graduation, in one of the online classes at Coursera lecturer mentioned Kaggle as a platform to practice your machine learning skills. First few competitions were epically failed, but I learned a lot. Slowly, piece by piece, I was able to merge theoretical knowledge from courses, books and papers with actual data cleaning and model training. During the first year I figured out how and when to apply classical methods like Logistic Regression, SVM, kMeans, Random Forest, xgboost, etc. Neural Network type competitions were rather rare and I did not have a chance to practice deep learning that much, but starting from December of the last year we already had seven computer vision problems. DSTL problem had the closest deadline and that is why I decided to join.

Vladimir on Kaggle.

Sergey Mushinskiy
I had 10+ years experience in IT, until recent switch to machine learning and software development.

I was in search for a great project to showcase my machine learning skills (especially deep learning part of it). And what could be better than winning a tough competition like this? There were several competitions on Kaggle at that time and it was a difficult choice to make – everything looked very interesting.

Sergey on Kaggle.

DSTL competition had the closest deadline, there were some technical challenges, and until the very late, notably less people than usual participated in the competition.. It was interesting to take this challenge that scared off even the top kagglers. However, during this competition it became obvious that there were a lot of extremely talented participants who did a good job and shared endless insights and solved a lot of challenging aspects.

But all this pale in comparison with encouragement coming from friends in our Russian Open Data Science community. It is a group of like-minded and extremely dedicated people, who decided to go all-in into this competition and who gave me support and motivation to start and persevere through all the challenges.

Let’s get technical

We tried to tackle this problem in a variety of different ways and, as expected, most of them didn’t work. We went through a ton of a literature and implemented a set of different network architectures before we eventually settled on a small set of relatively simple yet powerful ideas which we are going to describe.

On the high level we had to solve an image segmentation problem. Ways to approach it are well known and there is number of papers regarding the topic. Also Vladimir already had experience with this kind of tasks when he participated in an Ultrasound Nerve Segmentation competition and placed 10th out of 923.

What was the data? Unlike the other computer vision problems where you are given RGB or grayscale images, we had to deal with the satellite data that is given both in visual and lower frequency regions. On one hand it carries more information, on another it is not really obvious how to use this extra data properly.

Data was divided into train (25 images) and test (32 images) sets.

Each image covers 1 square kilometer of the earth surface. The participants were provided with three types of images of the same area: a high-resolution panchromatic (P), an 8-band image with a lower resolution (M-band), and a longwave (A-band) that has the lowest resolution of all. As you can see from the image above, RGB and M-band partially overlap in the optical spectral range. Turns out that RGB was itself reconstructed at the postprocessing step from a combination of a low resolution M-band image and high resolution panchromatic image.

RGB + P (450-690 nm), 0.31 m / pixel, color depth 11bit;
M band (400-1040 nm), 1.24 m / pixel, color depth 11 bit;
A band (1195-2365 nm), 7.5 m / pixel, color depth 14 bit.

Another interesting fact is that our images have color depth of 11 and 14-bit instead of a more common 8-bit. From a neural network perspective it is better, each pixel carries more information but for a human it introduces additional steps required for visualization.

As you can see input data contains a lot of interesting information. But what is our output? We want to assign one or more class labels to the each pixel of the input image.

Buildings: large buildings, residential, non-residential, fuel storage facilities, fortified building;
Misc. man-made structures;
Roads;
Track - poor/dirt/cart tracks, footpaths/trails;
Trees - woodland, hedgerows, groups of trees, stand-alone trees;
Crops - contour ploughing/cropland, grain (wheat) crops, row (potatoes, turnips) crops;
Waterways;
Standing water;
Vehicle (Large) - large vehicle (e.g. lorry, truck,bus), logistics vehicle;
Vehicle (Small) - small vehicle (car, van), motorbike.

The prediction for each class was evaluated independently using Average Jaccard Index (also known in literature as Intersection-over-Union), and the class-wise scores were averaged over all ten classes with equal weights.

Overall, the problem looks like a standard image segmentation problem with some multispectral input specificities.

About the data

One of the issues that participants needed to overcome during competition is a lack of training data. We were provided 25 pictures, covering 25 square kilometers. This may sound like a lot, but these images are pretty diverse: jungles, villages and farmland. And they are really different. This made our life harder.

Clouds are a major challenge for satellite imaging. In the provided data, however, clouds were presented rather as an exception than a rule. So the issue did not affect us as much.

The fact that we did not have access to the images of the same area at different time was a big complication. We believe that temporal information could significantly improve our model performance. For instance pixel wise difference between images of the same area corresponds to an object that can change its position with time making it possible to identify moving cars.

Another interesting thing is that satellite takes a shot in such a way that in the M-band, channels 2, 3, 5 and 7 come a few seconds later than 1, 4, 6 and 8 which leads to a ghosts of a moving objects:

Speaking of cars, they were often labeled unreliably. The image below shows the reference annotation of cars. Needless to say, that a fraction of this reference data corresponds to unrelated objects: debris, random road pixels, etc:

As a result of all these complications we gave up on the both vehicle classes and did not predict them in our final solution. There were a few approaches that we wanted to try, like constrain the phase space to villages and roads and run FasterRNN or SSD to localize the cars, but we did not have time to implement it.

Now when we are done with an introduction we would like to present a set of ideas that helped us to finish at the third place.

First idea: Class distributions

A stable local validation is 90% of the success. In every ML textbook one can find the following words: “Assuming that train and test set are generated from the same iid…”, followed by a discussion on how to perform cross-validation. In practice, both in industry and in competitions, this assumption is satisfied only approximately, and it is very important to know how accurate this approximation is.

Class distribution in the training set can be easily obtained from the annotation data / labels. To find out how much area each class occupies in the test set we used the following trick:

For each class we made a dedicated submission in which all pixels were attributed to that class only. A jaccard score returned by the system for such submission gave us the relative area of the class in the public fraction of the test set. This approach worked because of the specific submission and evaluation formats of this particular competition (recall that the final score is an average of the classes scores, so a submission with just one class presented would give a score for this class). In addition, when the competition was over, we used the same trick to obtain distributions of the classes in images of the private fraction of the test set.

First of all, this plot tells us, that classes are heavily imbalanced within each set of images. For example, one pixel of both large and small vehicle classes corresponds to a 60,000 pixels with crops. This lead us to a conclusion that training a separate model per class will work much better than a one model that predicts all classes at a time. Of course we tried to train such a model and it works surprisingly well. We we able to get in a top 10% but not more.

Secondly, the distributions themselves vary significantly from one set of images to another.

For instance, an initial version of the train set did not contain any images with waterways, and only after competitors pointed that out, the organizers moved two images of rivers from the test into the train set. One can see that this particular class was still under-represented in the train set compared to the public and the private test sets. As a result simple unsupervised methods for water classes worked better than neural network approaches.

It is also important to mention that when train and test sets are relatively different those that participated in the competition after results on the Private Leaderboard are released change their standing a lot (in this problem participants in the middle of the Leaderboard easily moved +- 100 places). And those that did not, as usual, claimed that participants are overfitters that do not understand machine learning and do not know how to perform cross validation properly.

Second idea: water classes

These days for the most of the computer vision problems neural networks are the most promising approach. What other methods can one use for the satellite image segmentation? Deep Learning became popular only in the last few years, and people have worked with satellite images for a much longer time. We discussed this problem with former and current employees of Orbital Insight and Descartes Labs plus we read a ton of literature on the subject. Apparently fact that we have infrared and other channels from non-optical frequency range allows to identify some classes purely from the pixel values, without any contextual information. Using this approach, the best results were obtained for water and vegetation classes.

For instance, in our final solution both water classes were segmented using NDWI, which is just a ratio of the difference and sum of the pixel values in the green and infrared channels.

This image demonstrates high intensity values for waterways, but it also shows false positives on some buildings – perhaps due to the relative similarity of the specific heat of metal roofs and water.

We expected a deep learning approach to perform as well as or even better than index thresholding and, in vegetation prediction, neural networks did indeed outperform indices. However, we found that indices allow us to achieve better results for under-represented classes such as waterways and standing water. In the provided images, ponds were smaller than rivers, so we additionally thresholded our predictions by area of water body to distinguish waterways from standing water.

Third idea: Neural Networks

As we mentioned before classes are very different and given the lack of data beating top 10% score with a single model would be very tough. Instead, we decided to train a separate network for each class, except water and cars.

We tried a lot of various network architectures and settled on a modified U-net that previously had shown very good results in the problem of Ultrasound Nerve Segmentation. We had a lot of hope for Tiramisu, but its convergence was slow and performance wasn’t satisfactory enough.

We used Nadam Optimizer (Adam with Nesterov momentum) and trained the network for 50 epochs with a learning rate of 1e-3 and additional 50 epochs with a learning rate of 1e-4. Each epoch was trained on 400 batches, each batch containing 128 image patches. Each batch was created randomly cropping 112x112 patches from original images. In addition each patch was modified by applying a random transformation from group D4.

Initially we tried 224x224 but due to limited GPU memory this would significantly reduce the batch size from 128 to 32. Larger batches proved to be more important than a larger receptive field. We believe that was due to the train set containing 25 images only, which differ from one another quite heavily. As a result, we decided to trade-off receptive field size in favour of a larger batch size.

Fourth idea: what do we feed to networks

If we just had RGB data and nothing else the problem would be much simpler. We would just crop big images into batches and feed them into a network. But we have RGB, P, M and A bands which have different color depth, resolution and could be shifted in time and space. All of them may contain unique and useful information that needs to be used.

After tackling this problem in a bunch of different ways we ended up dropping A-band, M was stretched from 800x800 to 3600x3600 and stacked it with RGB and P images.

Deep neural networks can find the interactions between different features when the amount of the data is sufficient, but we suspected that for this problem it was not the case. In the recent Allstate competition, quadratic features improved performance of xgboost, which is very efficient at finding feature interactions as well. We followed a similar path and added the indices CCCI, EVI, SAVI, NDWI as four extra channels. We like to think of these extra channels as an added domain knowledge.

We suspect that these indices would not enhance performance on larger dataset.

Fifth idea: loss function

As already mentioned, the evaluation metric for this competition was an Average Jaccard Index. A common loss function for classification tasks is categorical cross entropy but in our case classes are not mutually exclusive and using binary cross entropy makes more sense. But we can go deeper. It is well known that in order to get better results your evaluation metric and your loss function need to be as similar as possible. The problem here however is that Jaccard Index is not differentiable. One can generalize it for probability prediction, which on one hand, in the limit of the very confident predictions, turns into normal Jaccard and on the other hand is differentiable – allowing the usage of it in the algorithms that are optimized with gradient descent.

That logic led us to the following loss function:

Sixth idea: local boundary effects

It is pretty clear how to prepare the patches for training - we just crop them from the original images, augment and feed into the network with the structure that we described above.

What about predictions? In the zeroth order approximation everything looks straightforward, partition (3600, 3600) image into small patches, make prediction and stitch them together.

We did this and got something like this:

Why do we have this weird square structure? And the answer is that not all outputs in the Fully Connected Networks are equally good. Number of ways that you can get from any input pixel to the central part of the output in a network is much higher than to the edge ones. As a result prediction quality is decreasing when you move away from center. We checked this hypothesis for a few classes, and, for example, for buildings logloss vs distance from center looks like this:

One way to deal with such issue was to make the predictions on overlapping patches, and crop them on the edges, but we came out with a better way. We added Cropping2D layer to the output layers of our networks, which solved two problem simultaneously:

Losses on boundary artefacts were not back-propagated through the network;
Edges of the predictions were cropped automatically.

As a bonus, the trick slightly decreased the computation time.

To summarize, we trained a separate model for each of the first six classes. Each of them took matrix with a shape (128, 16, 112, 112) as an input and returned the mask for a central region of the input images (128, 1, 80, 80) as an output.

Seventh idea: global boundary effect

Are we done with the boundary effects? Not yet. To partition original images into (112, 112) tiles we added zeros to the edges of the original (3600, 3600) images, classical ZeroPadding. This added some problems at the prediction time. For example, sharp change in pixel values from central to zero padded area was probably interpreted by a network as a wall of the building and as a result we got a layer of building all over the perimeter in the predicted mask.

This issue was addressed with the same trick as in the original U-net paper. We added reflections of the central part to the padded areas.

We also considered modifying zero padding layers within a network to pad with reflections instead of zeros in a similar manner. Theoretically, this micro optimization may improve overall network performance, but we did not check it.

Eighth idea: test time augmentation

At this stage we already had a strong solution that allowed to get in top 10. But you can always go deeper. If it is possible to augment your train set to increase its size, you can always perform test time augmentation to decrease variance of the predictions.

We did it in the following way:
1. Original (3600, 3600) image is rotated by 90 degrees and we get 2 images: original and rotated.
2. Both are padded.
3. Split into tiles.
4. We perform predictions on each tile.
5. Combine predictions back into the original size.
6. Crop padding areas.
7. Prediction of the rotated image is rotated back to the original orientation.
8. Results of the both prediction pipelines averaged with geometric mean.

A schematic representation of the prediction pipeline:

How does it improve our predictions?
1. We decrease variance of the predictions.
2. Images are split in tiles in a different way and this helps to decrease local boundary effects.

We used 4 image orientations for test time augmentation, but one can do it with all eight elements of the D4 group. In general, any augmentation that can be used at the train time can be also used at the test time.

After all of these steps we managed to get something which looks surprisingly nice:

We finished 8th on the Public Leaderboard and 3rd on the Private.

Many thanks to Artem Yankov, Egor Panfilov, Alexey Romanov, Sarah Bieszad and Galina Malovichko for help in preparation of this article.

More from this team

Vladimir gave a talk based on his team's winning approach in this competition at the SF Kagglers meetup in April. Check out the video below!

Two Sigma Financial Modeling Kaggle Code Competition Winners' Interview

Kaggle's inaugural code competition, the Two Sigma Financial Modeling Challenge ran from December 2016 to March 2017. Over 2,000 players competed to search for signal in unpredictable financial markets data. As the very first code competition, competitors experimented with the data, trained models, and made submissions directly via Kernels, Kaggle's in-browser code execution platform.

In this winners' interview, team Bestfitting describes how they managed to remain a top-5 team even after a wicked leaderboard shake-up by focusing on building stable models and working effectively as a team. Read on to learn how they accounted for volatile periods of the market and experimented with reinforcement learning approaches.

The basics

What was your background prior to entering this challenge?

Bestfitting: I’ve worked as software developer for more than 15 years and as a machine learning & deep learning researcher for 5 years.

Bestfitting on Kaggle.

Zero: I worked as a bank data analyst for more than 4 years. I am enthusiastic about machine leaning.

Zero on Kaggle.

CircleCircle: I am a data analyst working on risk-control related solutions for banks.

CircleCircle on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

We learned a lot of skills from previous Kaggle competitions, such as feature analysis, feature selection, validation set build and how to control over-fitting.

According to our experiences in bank data analysis, we prefer to keep models stable in all kinds of market situations. We don’t pursue excessive public scores; a profitable yet stable model is the best.

How did you get started competing on Kaggle?

Bestfitting: I need all kinds of datasets and challenges to validate algorithms I’ve learned.

Zero: When I learned CS229 by Andrew Ng, he advised us to enter a competition.

CircleCircle: Kaggle is a great platform, I learned a lot from forums, and then, I decided to enter a competition and have a try.

What made you decide to enter this competition?

We entered this competition for two reasons:

First, as we know, Two Sigma is a very successful and creative company and we guessed the competition they hosted should be very interesting

Second, predicting financial market is very hard, we want to see how well we can do by using machine learning skills.

Let’s get technical

Summary

We are very happy with the result: we are the only team that stayed in the top 5 on both public and private leaderboards. We feel we are very lucky, but we can not win a competition only by luck; we did a lot of work to ensure profitable yet stable models.

We tried to build effective methods to evaluate our models and control risks, although we don’t have much financial background, we wanted to learn some ideas about quantitative investment.

Features

We used 4 kinds of features:

Basic features. Original features from the dataset which two-sigma provided.
Calculated features and lag features. Get by using simple functions on basic features, such as abs, log, standard deviation, and so on. We also used features of last few timestamps which are called lag-N features.
Predicted features. Predicted from first level weak model. They were used in second level model.
Whole-market features. We tried to build some features to get information from whole market: increasing or decreasing, calm or volatile. They were also used in our self-adaptive strategy.

Validation

We wanted to introduce some validation methods we used through the whole competition.

The first one is the Cumulative-R: we plotted the cumsum of R to find the performance as time goes on.

And we defined another simple reward value, we called it y-sign-R. For each sample, if the predicted y has same sign of real y, then the reward is 1, otherwise, -1. We summed the reward value up, and plot cumsum curve, we can see the curve on right side. If the cumsum of y-sign-R is less than zero, we think it is not a good model because they are not better than random guess. We can see that both the ET and LR model performed not so well, especially the ET model.

Models

We developed our models independently before we teamed up. Bestfitting and Zero’s model can get top-10 on private LB individually. We did not use CircleCircle’s model in final ensemble model due to run time limitation of the competition, but we used some features from her model.

Bestfitting’s model

Bestfittings model

We realized our models cannot identify the market environment, as we know, asset prices wave along with the whole market. So we plotted the y-mean of each timestamp and found that there were two volatile periods. We needed to add this information to the models.

Two volatile periods

So, we added mean of t_20 and t_30 of each timestamp and used it in ET model. The public score improved a lot and private score had a big improvement.

Bestfitting’s post-processing

BestFitting plotted the cumsum of real-y, y predicted by a ridge model, by ET model and an ensemble model. He found that the ET model had better performance in volatile periods especially while the market was increasing. And the ridge model has better performance in relatively calm/smooth periods.

And he also found that the ridge model can predict y-sign much better, but the value it predicted was small, if the market is in volatile periods the R reward will be small although the sign is correct.

So he tried to make his model more adaptive and can select correct model in different periods.

At this stage, we have teamed up. We tried a lot of methods, including reinforcement learning, but we couldn’t find a very stable reward, and time is limited, so we chose a rule based way.

We must let our model know whether the market is increasing or not, calm or volatile, so we defined some measurement, for example, we counted the sign of y_mean of last 5 timestamps, and used them as an indicator to ensemble Ridge and ET model dynamically. After these efforts his model went up to top 7 on public leader board and the private score dropped a little but it’s healthy he thought.

Journey of Bestfitting’s improvement

Let us have a look at Bestfitting’s journey of improvement in one chart. We think the performance was getting better in a stable way. If the competition ended at any time, the model will not be over-fitting.

Bestfitting's performance over time

Zero’s model
Before we teamed up, Zero’s model had a decent public score, but the private score is not so good. After we teamed up, Zero added whole market features to his models and his public scores improved a little and his private scores had a huge improvement.

Zero's model

Zero’s post-processing
After we teamed up, the most important thing we did is to make our model adaptive. Zero used a different strategy: he wanted to let model know whether the market is in volatile periods by standard deviation. He evaluated the standard deviation of y_mean of last 5 timestamps and compared it with standard deviation of y-mean of the training set. He then ensembled the models dynamically.

Zero's dynamic ensembling

Journey of Zero’s improvement
Now, let us have a look at the journey of zero’s improvement. As we can see, his model had a huge improvement in private score after we used whole-market features.

Zero's improvements over time

Ensemble
OK, let us go on with the ensemble of the two models.

We used weighted average of the two models, and we found that Zero’s model had worse performance when the market is calm, so we gave small weights to Zero’s model.

Ensemble of Bestfitting and Zero's models

Teamwork

How did your team form?

We built independent models before we teamed up, and we all had decent public score, but we were exceeded by other competitors as the competition went on. So we all realized that we needed to team up for stable models and good position.

How did your team work together?

After teaming up, Bestfitting was in charge of the structure of the stable model and the concept or similar concept of reinforcement learning so that the model was self-adaptive and stable.

Zero integrated the source codes of the models and he improved the performance making it possible to finish in 1 hour. CircleCircle tried to build a reinforcement learning model and built better validation sets. She also evaluated the model under different market environments.

Words of wisdom

What have you taken away from this competition?

We would rather believe in local multiple verification methods compared to public scores, such as R value of accumulated timestamps, the comparison between predicted Y accumulated and true Y accumulated value, and accuracy of Y’s positive or negative. So we could ensure the stability of model to get a certain income, but also to maintain a low risk.

We used whole market features to let our model have information about the market which let our models be stable.

We also tried some strategies to make our models more adaptive to different periods.

Do you have any advice for those just getting started in data science?

Get knowledge from good courses such as Stanford CS229 and CS231n.

Get information from competitions on Kaggle, kernels, and starter scripts.

Enter Kaggle competitions and get feedback from them.

Thanks

This is our story of hard and happy journey. We must give great thanks to Kaggle and Two Sigma. We think the code competition is more fair and we must pay more attention to the speed which led to more useful models.

Kaggle Data Science Bowl Competition Write Up Team Deep Breath

Team Deep Breath's solution write-up was originally published here by Elias Vansteenkiste and cross-posted on No Free Hunch with his permission.

The Data Science Bowl is an annual data science competition hosted by Kaggle. In this year’s edition the goal was to detect lung cancer based on CT scans of the chest from people diagnosed with cancer within a year.

To tackle this challenge, we formed a mixed team of machine learning savvy people of which none had specific knowledge about medical image analysis or cancer prediction. Hence, the competition was both a noble challenge and a good learning experience for us.

The competition just finished and our team Deep Breath finished 9th! In this post, we explain our approach.

The Deep Breath team consists of Andreas Verleysen, Elias Vansteenkiste, Fréderic Godin, Ira Korshunova, Jonas Degrave, Lionel Pigou and Matthias Freiberger. We are all PhD students and postdocs at Ghent University.

The 10 Most Common Causes of Cancer Death (Credit: Cancer Research UK)

Introduction

Lung cancer is the most common cause of cancer death worldwide. Second to breast cancer, it is also the most common form of cancer. To prevent lung cancer deaths, high risk individuals are being screened with low-dose CT scans, because early detection doubles the survival rate of lung cancer patients. Automatically identifying cancerous lesions in CT scans will save radiologists a lot of time. It will make diagnosing more affordable and hence will save many more lives.

To predict lung cancer starting from a CT scan of the chest, the overall strategy was to reduce the high dimensional CT scan to a few regions of interest. Starting from these regions of interest we tried to predict lung cancer. In what follows we will explain how we trained several networks to extract the region of interests and to make a final prediction starting from the regions of interest.

This post is pretty long, so here is a clickable overview of different sections if you want to skip ahead:

The Needle in The Haystack

To determine if someone will develop lung cancer, we have to look for early stages of malignant pulmonary nodules. Finding an early stage malignant nodule in the CT scan of a lung is like finding a needle in the haystack. To support this statement, let’s take a look at an example of a malignant nodule in the LIDC/IDRI data set from the LUng Node Analysis Grand Challenge. We used this dataset extensively in our approach, because it contains detailed annotations from radiologists.

Given the wordiness of the official name, it is commonly referred as the LUNA dataset, which we will use in what follows.

A close-up of a malignant nodule from the LUNA dataset (x-slice left, y-slice middle and z-slice right).

The radius of the average malicious nodule in the LUNA dataset is 4.8 mm and a typical CT scan captures a volume of 400mm x 400mm x 400mm. So we are looking for a feature that is almost a million times smaller than the input volume. Moreover, this feature determines the classification of the whole input volume. This makes analyzing CT scans an enormous burden for radiologists and a difficult task for conventional classification algorithms using convolutional networks.

This problem is even worse in our case because we have to try to predict lung cancer starting from a CT scan from a patient that will be diagnosed with lung cancer within one year of the date the scan was taken. TIn the LUNA dataset contains patients that are already diagnosed with lung cancer. In our case the patients may not yet have developed a malignant nodule. So it is reasonable to assume that training directly on the data and labels from the competition wouldn’t work, but we tried it anyway and observed that the network doesn’t learn more than the bias in the training data.

Nodule Detection

Nodule Segmentation

To reduce the amount of information in the scans, we first tried to detect pulmonary nodules.

We built a network for segmenting the nodules in the input scan. The LUNA dataset contains annotations for each nodule in a patient. These annotations contain the location and diameter of the nodule. We used this information to train our segmentation network.

The chest scans are produced by a variety of CT scanners, this causes a difference in spacing between voxels of the original scan. We rescaled and interpolated all CT scans so that each voxel represents a 1x1x1 mm cube. To train the segmentation network, 64x64x64 patches are cut out of the CT scan and fed to the input of the segmentation network. For each patch, the ground truth is a 32x32x32 mm binary mask. Each voxel in the binary mask indicates if the voxel is inside the nodule. The masks are constructed by using the diameters in the nodule annotations.

intersection = sum(y_true * y_pred)
dice = (2. * intersection) / (sum(y_true) + sum(y_pred))

As objective function we choose to optimize the Dice coefficient. The dice coefficient is a commonly used metric for image segmentation. It behaves well for the imbalance that occurs when training on smaller nodules, which are important for early stage cancer detection. A small nodule has a high imbalance in the ground truth mask between the number of voxels in- and outside the nodule.

The downside of using the Dice coefficient is that it defaults to zero if there is no nodule inside the ground truth mask. There must be a nodule in each patch that we feed to the network. To introduce extra variation, we apply translation and rotation augmentation. The translation and rotation parameters are chosen so that a part of the nodule stays inside the 32x32x32 cube around the center of the 64x64x64 input patch.

The network architecture is shown in the following schematic. The architecture is largely based on the U-net architecture, which is a common architecture for 2D image segmentation. We adopted the concepts and applied them to 3D input tensors. Our architecture mainly consists of convolutional layers with 3x3x3 filter kernels without padding. Our architecture only has one max pooling layer, we tried more max pooling layers, but that didn’t help, maybe because the resolutions are smaller than in case of the U-net architecture. The input shape of our segmentation network is 64x64x64. For the U-net architecture the input tensors have a 572x572 shape.

A schematic of the segmentation network architecture. The tensor shapes are indicated inside the dark grey boxes and network operations inside the light grey. A C1 is a convolutional layer with 1x1x1 filter kernels and C3 is a convolutional layer with 3x3x3 filter kernels

The trained network is used to segment all the CT scans of the patients in the LUNA and DSB dataset. 64x64x64 patches are taken out the volume with a stride of 32x32x32 and the prediction maps are stitched together. In the resulting tensor, each value represents the predicted probability that the voxel is located inside a nodule.

Blob Detection

In this stage we have a prediction for each voxel inside the lung scan, but we want to find the centers of the nodules. The nodule centers are found by looking for blobs of high probability voxels. Once the blobs are found their center will be used as the center of nodule candidate.

In our approach blobs are detected using the Difference of Gaussian (DoG) method, which uses a less computational intensive approximation of the Laplacian operator.
We used the implementation available in skimage package.

After the detection of the blobs, we end up with a list of nodule candidates with their centroids.
Unfortunately the list contains a large amount of nodule candidates. For the CT scans in the DSB train dataset, the average number of candidates is 153.
The number of candidates is reduced by two filter methods:

Applying lung segmentation before blob detection
Training a false positive reduction expert network

Lung Segmentation

Since the nodule segmentation network could not see a global context, it produced many false positives outside the lungs, which were picked up in the later stages. To alleviate this problem, we used a hand-engineered lung segmentation method.

At first, we used a similar strategy as proposed in the Kaggle Tutorial. It uses a number of morphological operations to segment the lungs. After visual inspection, we noticed that quality and computation time of the lung segmentations was too dependent on the size of the structuring elements. A second observation we made was that 2D segmentation only worked well on a regular slice of the lung. Whenever there were more than two cavities, it wasn’t clear anymore if that cavity was part of the lung.

An example of a z-slice where you can see multiple cavities with air. The main ones are inside the lungs and the other ones are future farts or burps happily residing in the intestines.

Our final approach was a 3D approach which focused on cutting out the non-lung cavities from the convex hull built around the lungs.

A z-slice of the CT scan in the middle of the the chest. On the left side the morphological approach, on the right side the convex hull approach.

False Positive Reduction

To further reduce the number of nodule candidates we trained an expert network to predict if the given candidate after blob detection is indeed a nodule. We used lists of false and positive nodule candidates to train our expert network. The LUNA grand challenge has a false positive reduction track which offers a list of false and true nodule candidates for each patient.

For training our false positive reduction expert we used 48x48x48 patches and applied full rotation augmentation and a little translation augmentation (±3 mm).

Architecture

If we want the network to detect both small nodules (diameter <= 3mm) and large nodules (diameter > 30 mm), the architecture should enable the network to train both features with a very narrow and a wide receptive field. The inception-resnet v2 architecture is very well suited for training features with different receptive fields. Our architecture is largely based on this architecture. We simplified the inception resnet v2 and applied its principles to tensors with 3 spatial dimensions. We distilled reusable flexible modules.

These basic blocks were used to experiment with the number of layers, parameters and the size of the spatial dimensions in our network.

A schematic of the spatial reduction block. The tensor shapes are indicated inside the dark grey boxes and network operations inside the light grey boxes

The first building block is the spatial reduction block. The spatial dimensions of the input tensor are halved by applying different reduction approaches. Max pooling on the one hand and strided convolutional layers on the other hand

A schematic of the feature reduction block

The feature reduction block is a simple block in which a convolutional layer with 1x1x1 filter kernels is used to reduce the number of features. The number of filter kernels is the half of the number of input feature maps.

A schematic of the residual convolutional block, with n the number of base filters

The residual convolutional block contains three different stacks of convolutional layers block, each with a different number of layers. The most shallow stack does not widen the receptive field because it only has one conv layer with 1x1x1 filters. The deepest stack however, widens the receptive field with 5x5x5. The feature maps of the different stacks are concatenated and reduced to match the number of input feature maps of the block. The reduced feature maps are added to the input maps. This allows the network to skip the residual block during training if it doesn’t deem it necessary to have more convolutional layers. Finally the ReLu nonlinearity is applied to the activations in the resulting tenor.

We experimented with these bulding blocks and found the following architecture to be the most performing for the false positive reduction task:

def build_model(l_in):
    l = conv3d(l_in, 64)

    l = spatial_red_block(l)
    l = res_conv_block(l)
    l = spatial_red_block(l)
    l = res_conv_block(l)
    l = spatial_red_block(l)
    l = res_conv_block(l)

    l = feat_red(l)
    l = res_conv_block(l)
    l = feat_red(l)

    l = dense(drop(l), 128)

    l_out = DenseLayer(l, num_units=1, nonlinearity=sigmoid)
    return l_out

An important difference with the original inception is that we only have one convolutional layer at the beginning of our network. In the original inception resnet v2 architecture there is a stem block to reduce the dimensions of the input image.

Results

Our validation subset of the LUNA dataset consists of the 118 patients that have 238 nodules in total. After segmentation and blob detection 229 of the 238 nodules are found, but we have around 17K false positives. To reduce the false positives the candidates are ranked following the prediction given by the false positive reduction network.

Top	True Positives	False Positives
10	221	959
4	187	285
2	147	89
1	99	19

Malignancy Prediction

It was only in the final 2 weeks of the competition that we discovered the existence of malignancy labels for the nodules in the LUNA dataset. These labels are part of the LIDC-IDRI dataset upon which LUNA is based. For the LIDC-IDRI, 4 radiologist scored nodules on a scale from 1 to 5 for different properties. The discussions on the Kaggle discussion board mainly focussed on the LUNA dataset but it was only when we trained a model to predict the malignancy of the individual nodules/patches that we were able to get close to the top scores on the LB.

def build_model(l_in):
    l = conv3d(l_in, 64)

    l = spatial_red_block(l)
    l = res_conv_block(l)
    l = spatial_red_block(l)
    l = res_conv_block(l)

    l = spatial_red_block(l)
    l = spatial_red_block(l)

    l = dense(drop(l), 512)

    l_out = DenseLayer(l, num_units=1, nonlinearity=sigmoid)
    return l_out

The network we used was very similar to the FPR network architecture. In short it has more spatial reduction blocks, more dense units in the penultimate layer and no feature reduction blocks.

We rescaled the malignancy labels so that they are represented between 0 and 1 to create a probability label. We constructed a training set by sampling an equal amount of candidate nodules that did not have a malignancy label in the LUNA dataset.

As objective function, we used the Mean Squared Error (MSE) loss which showed to work better than a binary cross-entropy objective function.

Lung Cancer Prediction

After we ranked the candidate nodules with the false positive reduction network and trained a malignancy prediction network, we are finally able to train a network for lung cancer prediction on the Kaggle dataset. Our strategy consisted of sending a set of n top ranked candidate nodules through the same subnetwork and combining the individual scores/predictions/activations in a final aggregation layer.

Transfer learning

After training a number of different architectures from scratch, we realized that we needed better ways of inferring good features. Although we reduced the full CT scan to a number of regions of interest, the number of patients is still low so the number of malignant nodules is still low. Therefore, we focussed on initializing the networks with pre-trained weights.

The transfer learning idea is quite popular in image classification tasks with RGB images where the majority of the transfer learning approaches use a network trained on the ImageNet dataset as the convolutional layers of their own network. Hence, good features are learned on a big dataset and are then reused (transferred) as part of another neural network/another classification task. However, for CT scans we did not have access to such a pretrained network so we needed to train one ourselves.

At first, we used the the fpr network which already gave some improvements. Subsequently, we trained a network to predict the size of the nodule because that was also part of the annotations in the LUNA dataset. In both cases, our main strategy was to reuse the convolutional layers but to randomly initialize the dense layers.

In the final weeks, we used the full malignancy network to start from and only added an aggregation layer on top of it. However, we retrained all layers anyway. Somehow logical, this was the best solution.

Aggregating Nodule Predictions

We tried several approaches to combine the malignancy predictions of the nodules. We highlight the 2 most successful aggregation strategies:

P_patient_cancer = 1 - ∏ P_nodule_benign: The idea behind this aggregation is that the probability of having cancer is equal to 1 if all the nodules are benign. If one nodule is classified as malignant, P_patient_cancer will be one.
The problem with this approach is that it doesn’t behave well when the malignancy prediction network is convinced one of the nodules is malignant. Once the network is correctly predicting that the network one of the nodules is malignant, the learning stops.
Log Mean Exponent: The idea behind this aggregation strategy is that the cancer probability is determined by the most malignant/the least benign nodule. The LME aggregation works as the soft version of a max operator. As the name suggest, it exponential blows up the predictions of the individual nodule predictions, hence focussing on the largest(s) probability(s). Compared to a simple max function, this function also allows backpropagating through the networks of the other predictions.

Ensembling

Our ensemble merges the predictions of our 30 last stage models. Since Kaggle allowed two submissions, we used two ensembling methods:

Defensive ensemble: Average the predictions using weights optimized on our internal validation set. The recurring theme we saw during this process was the high reduction of the number of models used in the ensemble. This is caused by the high similarity between the models. It turned out that for our final submission, only one model was selected.
Aggressive ensemble: Cross-validation is used to select the high-scoring models that will be blended uniformly. The models used in this ensemble are trained on all the data, hence the name ‘aggressive ensemble’. We uniformly blend these ‘good’ models to avoid the risk of ending up with an ensemble with very few models because of the high pruning factor during weight optimization. It also reduces the impact of an overfitted model.
Reoptimizing the ensemble per test patient by removing models that disagree strongly with the ensemble was not very effective because many models get pruned anyway during the optimization. Another approach to select final ensemble weights was to average the weights that were chosen during CV. This didn’t improve our performance. We also tried stacking the predictions using tree models but because of the lack of meta-features, it didn’t perform competitively and decreased the stability of the ensemble.

Final Thoughts

A big part of the challenge was to build the complete system. It consists of quite a number of steps and we did not have the time to completely fine tune every part of it. So there is still a lot of room for improvement. We would like to thank the competition organizers for a challenging task and the noble end.

The Deep Breath Team
Andreas Verleysen @resivium
Elias Vansteenkiste @SaileNav
Fréderic Godin @frederic_godin
Ira Korshunova @iskorna
Jonas Degrave @317070
Lionel Pigou @lpigou
Matthias Freiberger @mfreib

Kaggle's 2017 March Machine Learning Mania competition challenged Kagglers to do what millions of sports fans do every year–try to predict the winners and losers of the US men's college basketball tournament. In this winner’s interview, 1st place winner, Andrew Landgraf, describes how he cleverly analyzed his competition to optimize his luck.

What made you decide to enter this competition?

I am interested in sports analytics and have followed the previous competitions on Kaggle. Reading last year’s winner’s interview, I realized that luck is a major component of winning this competition, just like all brackets. I wanted to see if there was a way of maximizing my luck. For example, when entering an office pool, your strategy depends on whether you are facing 5 Duke alumns or the entire office. My goal was to systematically optimize my submissions against the competition.

This competition is unique among Kaggle contests in that there is a history of submissions from previous years. My idea was to model not only the probability of each team winning each game, but also the competitors’ submissions. Combining these models, I searched for the submission with the highest chance of finishing with a prize (top 5 on the leaderboard). A schematic of my approach is below. The three main processes are shaded in blue: (1) A model of the probability of winning each game, (2) a model of what the competitors are likely to submit, and (3) an optimization of my submission based on these two models.

A schematic of Andrew's approach

While I believe this approach is generally worthwhile, a much simpler approach would have also won the competition, as discussed at the end.

What was your approach? Did past March Mania competitions inform your winning strategy?

I kept my models simple and probabilistic. To model the outcomes of each game, I used a similar method as previous winners, One Shining MGF. I created my own team efficiency ratings using a regression model so that I could calculate the historical ratings before the tournament started. The ratings, and a distance from home metric (more on this later), were used as covariates in a Bayesian logistic regression model (using the rstanarm package) to predict the outcomes of each game.

To model competitors’ submissions, I built a mixed effects model (with lme4) using data from the previous competitions. I used the logit of the submitted probability as the response, the team efficiencies as fixed effects, random intercepts for competitors and games, and random efficiency slopes for competitors. I guessed that there would be 500 competitors and that 400 of them would make 2 submissions, which wasn’t too far off.

The plot below shows the models for the two Final Four semi-final games. The black lines are densities of 100 simulations from the mixed effects model and the orange line is the true distribution of competitors’ predictions. They line up well for the SC vs. Gonzaga game and a little less so for the Oregon vs. UNC game. The posterior distribution from my model is much tighter than distributions from the competitors. My two submissions are the two vertical lines.

Models for the two Final Four semi-final games

Finally, I used these models to come up with an optimal submission by simulating the bracket and the competitions’ submissions 10,000 times. This essentially gave me 10,000 simulated leaderboards of the competitors and my goal was to find the submission that most frequently showed up in the top 5 of the leaderboard. I tried to use a general-purpose optimizer, but it was very slow and it gave poor results. Instead, I sampled pairs of probabilities from the posterior many times, and chose the pair that was in the top 5 the most times. If I had naively used the posterior mean as a submission, my estimated probability of being in the top 5 would have been 15%, while my estimated probability of for the optimized submission (with two entries) went up to 25%.

The competitors’ submission model was trained on 2015 data. To assess the quality of the model, I have plotted the simulated distribution of the leaderboard losses for 2016 and 2017 and compared to the actual leaderboards. 2016 seems well in line, but 2017 had more submissions with lower losses than predicted. For both years, the actual 5th place loss was right in line with what was expected.

Simulated distribution of the leaderboard losses for 2016 and 2017 and compared to the actual leaderboards

Looking back, what would you do differently now?

A common strategy for this competition is to use the same predictions in both submissions except for the championship game, in which each team is given a 100% chance of winning in one of the submissions, guaranteeing that one of the two submissions will get the last game exactly correct. While I was aware of this strategy beforehand, I didn’t realize how good it is. If I had used this strategy, my estimated probability of being in the top 5 was 27%, 2 percentage points higher than my submission. This submission would have also won the competition.

What have you taken away from this competition?

Sometimes it’s better to be lucky than good. The location data that I used had a coding error in it. South Carolina’s Sweet Sixteen and Elite Eight games were coded as being in Greenville, SC instead of New York City. The led me to give them higher odds than most others, which helped me since they won. It is hard to say what the optimizer would have selected (and how it affected others’ models), but there is a good chance I would have finished in 2nd place or worse if the correct locations had been used.

Bio

Andrew Landgraf is a research statistician at Battelle. He received his Ph.D. in statistics from the Ohio State University, researching dimensionality reduction of binary and count data. At Battelle, he applies his statistical and machine learning expertise to problems in public health, cyber security, and transportation.