In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label satellite images from the Amazon basin, in order to better track and understand causes of deforestation.

The competition contained over 40,000 training images, each of which could contain multiple labels, generally divided into the following groups:

Atmospheric conditions: clear, partly cloudy, cloudy, and haze
Common land cover and land use types: rainforest, agriculture, rivers, towns/cities, roads, cultivation, and bare ground
Rare land cover and land use types: slash and burn, selective logging, blooming, conventional mining, artisanal mining, and blow down

We recently talked to user bestfitting, the winner of the competition, to learn how he used an ensemble of 11 finely tuned convolutional nets, models of label correlation structure, and a strong focus on avoiding overfitting, to achieve 1st place.

Basics

What was your background prior to entering this challenge?

I majored in computer science and have more than 10 years of experience programming in Java and working on large-scale data processing, machine learning, and deep learning.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I entered a few deep learning competitions on Kaggle this year. The experiences and the intuition I gained helped a lot.

How did you get started competing on Kaggle?

I’ve been reading a lot of books and papers about machine learning and deep learning since 2010, but I always found it hard to apply the algorithms I learned on the kinds of small datasets that are usually available. So I found Kaggle a great platform, with all the interesting datasets, kernels, and great discussions. I couldn’t wait to try something, and entered the “Predicting Red Hat Business Value” competition last year.

What made you decide to enter this competition?

I entered this competition for two reasons.

First, I’m interested in nature conservation. I think it’s cool to use my skills to make our planet and life better. So I’ve entered all the competitions of this kind that Kaggle has hosted this year. And I’m especially interested in the Amazon rainforest since it appears so often in films and stories.

Second, I’ve entered all kinds of deep learning competitions on Kaggle using algorithms like segmentation and detection, so I wanted a classification challenge to try something different.

Let's Get Technical

Can you introduce your solution briefly first?

This is a multi-label classification challenge, and the labels are imbalanced.

It’s a hard competition, as image classification algorithms have been widely used and built upon in recent years, and there are many experienced computer vision competitors.

I tried many kinds of popular classification algorithms that I thought might be helpful, and based on careful analysis of label relationships and model capabilities, I was able to build an ensemble method that won 1st place.

This was my model’s architecture:

In words:

First, I preprocessed the dataset (by resizing the images and removing haze), and applied several standard data augmentation techniques.
Next, for my models, I fine-tuned 11 convolutional neural networks (I used a variety of popular, high-performing CNNs like ResNets, DenseNets, Inception, and SimpleNet) to get a set of class label probabilities for each CNN.
I then passed each CNN’s class label probabilities through its own ridge regression model, in order to adjust the probabilities to take advantage of label correlations.
Finally, I ensembled all 11 CNNs, by using another ridge regression model.
Also of note is that instead of using a standard log loss as my loss function, I used a special soft F2-loss in order to get a better score on the F2 evaluation metric.

What preprocessing and feature engineering did you do?

I used several preprocessing and data augmentation steps.

First, I resized images.
I also added data augmentation by flipping, rotating, transposing, and elastic transforming images in my training and test sets.
I also used a haze removal technique, described in this “Single Image Haze Removal using Dark Channel Prior” paper, to help my networks “see” the images more clearly.

Here are some examples of haze removal on the dataset:

As we can see in the following chart, haze removal improved the F2 score of some labels (e.g., water and bare_ground), but decreased the F2 score of others (e.g., haze and clear). However, this was fine since ensembling can select the strongest models for each label, and the haze removal trick helped overall.

What supervised learning methods did you use?

The base of my ensemble consisted of 11 popular convolutional networks: a mixture of ResNets and DenseNets with different numbers of parameters and layers, as well an Inception and SimpleNet model. I fine-tuned all layers of these pre-trained CNNs after replacing the final output layer to meet the competition's output, and I didn't freeze any layers.

The training set consisted of 40,000+ images, so would have been large enough to even train some of these CNN architectures from scratch (e.g., resnet_34 and resnet_50), but I found that fine-tuning the weights of the pre-trained network performed a little better.

Did you use any special techniques to model the evaluation metric?

Submissions were evaluated on their F2 score, which is a way of combining precision and recall into a single score – like the F1 score, but with recall weighted higher than precision. Thus, we needed not only to train our models to predict label probabilities, but also had to select optimum thresholds to determine whether or not to select a label given its probability.

At first, like many other competitors, I used log loss as my loss function. However, as the chart below shows, lower log losses don’t necessarily lead to higher F2 scores.

This means we should find another kind of loss function that allows our models to pay more attention to optimizing each label’s recall. So with the help of code from the forums, I wrote my own Soft F2-Loss function.

This did indeed improve the overall F2 score, and in particular, the F2 score of labels like agriculture, cloudy, and cultivation.

What was your most important insight into the data and models?

I analyzed the correlation between labels, and found that certain labels coexist quite frequently, whereas others do not. For example, the clear, partly cloudy, cloudy, and haze labels are disjoint, but habitation and agriculture labels appear together quite frequently. This meant that making use of this correlation structure would likely improve my model.

For example, let’s take my resnet-101 model. This predicts probabilities for each of the 17 labels. In order to take advantage of label correlations, though, I added another ridge-regularized layer to recalibrate each label probability given all the others.

In other words, to predict the final clear probability (from the resnet-101 model alone), I have a specific clear ridge regression model that takes in the resnet-101 model’s predictions of all 17 labels.

How did your ensemble your models?

After we get predictions from all N models, we have N probabilities of the clear label from N different models. We can use them to predict the final clear label probability, by using another ridge regression.

This kind of two-level ridge regression does two things:

First, it allows us to use the correlation information among the different labels.
It allows us to select the strongest models to predict each label.

Were you surprised by any of your findings?

Even though I’d predicted the final shakeup of the leaderboard (where the public and private leaderboard scores differed quite a bit), I was still surprised.

Essentially, at the last stage of the competition (10 days before the end), I found that the public scores were very close, and I couldn’t improve my local cross-validation or public scores any more. So I warned myself to be careful to avoid overfitting on what could just be label noise.

To understand this pitfall better, I simulated the division into public and private leaderboards by using different random seeds to select half of the training set images as new training sets. I found that as the seed changed, the difference between my simulated public and private scores could grow up to 0.0025. But the gap between the Top 1 and Top 10 entries on the public leaderboard was smaller than this value.

This meant that a big shakeup could very likely happen in the real competition as well.

After carefully analyzing, I found that this kind of variation arose with difficult images where labels were prone to confusion from humans as well, like whether an image should be labeled haze vs. cloudy, road vs. water, or blooming vs. selective logging.

Because of this, I persuaded myself that the public leaderboard scores weren’t a perfect metric of model capability. This was unexpected: since the public test set contains 40,000+ images, it seems like the leaderboard should be pretty stable.

So I adjusted my goal to simply keep myself in the top 10, and decided not to care about my exact position on the public leaderboard in the last week. Instead, I tried to find the most stable way to ensemble my models, I threw away any models that would likely lead to overfitting, and in the end I used voting and ridge regression.

Why so many models?

The answer is simple: diversity.

I don’t think the number of models is a big problem, for several reasons:

First, if we want a simple model, we can simply choose 1-2 of them, and it will still get a decent score on both the public and private leaderboards (top 20).
Second, we have 17 labels, and different models have different capabilities on each label.
Third, our solution will be used to replace or simplify the human labeling job. Since computational resources are relatively cheaper than humans, we can predict unlabeled images by using strong models, modify any incorrectly predicted images, and then use the expanded data set to train stronger or simpler models iteratively.

What tools did you use?

Python 3.6, PyTorch, PyCharm community version.

What does your hardware setup look like?

A server with four NVIDIA GTX TITAN X Maxwell GPUs.

Words of wisdom

What have you taken away from this competition?

As we discussed above, I found that using a soft F2-loss function, adding a haze-removal algorithm, and applying two-level ridge regression were important in achieving good scores.

Also, due to label noise, we must trust our local cross-validation.

Do you have any advice for those just getting started in data science?

Learn from good courses like Stanford’s CS229 and CS231n.
Learn from Kaggle competitions, kernels, and starter scripts.
Enter Kaggle competitions and use them to get feedback.
Read papers everyday and implement some of them.