Can you see the random forest for its leaves? The Leaf Classification playground competition ran on Kaggle from August 2016 to February 2017. Kagglers were challenged to correctly identify 99 classes of leaves based on images and pre-extracted features. In this winner's interview, Kaggler Ivan Sosnovik shares his first place approach. He explains how he had better luck using logistic regression and random forest algorithms over XGBoost or convolutional neural networks in this feature engineering competition.
Brief intro
I am an MSc student in Data Analysis at Skoltech, Moscow. I joined Kaggle about a year ago when I attended my first ML course at university. The first competition was What’s Cooking. Since that, I’ve participated in several Kaggle competitions, but didn’t pay so much attention to it. It was more like a bit of practice to understand how ML approaches work.
The idea of Leaf Classification was very simple and challenging. Seemed like I wouldn’t have to stack so many models and the solution could be elegant. Moreover, the total volume of data was just 100+ Mb and the process of learning could be performed even with a laptop. It was very promising because the majority of the computations was supposed to be done on my MacBook Air with 1,3 GHz Intel Core i5 and 4 Gb RAM.
I have worked with black-and-white images before. And there is a forest near my house. However, it didn’t give me so much profit in this competition.
Let’s get technical
When I joined the competition, several kernels with top 20% scores were published. The solutions used the initially extracted features and Logistic Regression. It gave . And no significant improvement could be achieved by tuning of the parameters. In order to enhance the quality, feature engineering had to be performed. Seemed like no one had done it because the top solution had slightly better score than mine.
Feature engineering
I did first things first and plotted the images for each of the classes.
The raw images had different resolution, rotation, aspect ratio, width, and height. However, the variation of each of the parameters within the class is less than between the classes. Therefore, some informative features could be constructed just on the fly. They are:
- width and height
- aspect ration:
width / height
- square:
width * height
- is orientation horizontal:
int(width > height)
Another very useful feature that seemed to help is the average value of the pixels of the image.
I added these features to the already extracted ones. Logistic regression enhanced the result. However, most of the work was yet to be done.
All of the above-described features represent nothing about the content of the image.
PCA
Despite the success of neural networks as feature extractors, I still like PCA. It is simple and allows one to get the useful representation of the image in . First of all, the images were rescaled to the size of . Then PCA was applied. The components were added to the set of previously extracted features.
The number of components was varied. Finally, I used principle components. This approach showed .
Moments and hull
In order to generate even more features, I used OpenCV. There is a great tutorial on how to get the moments and hull of the image. I also added some pairwise multiplication of several features.
The final set of features is the following:
- Initial features
- height, width, ratio etc.
- PCA
- Moments
The Logistic Regression demonstrated .
The main idea
All of the above-described demonstrated good result. Such result would be appropriate for real life application. However, it could be enhanced.
Uncertainty
The majority of objects had certain decision: there was the only class with and the rest had . However, I found several objects with uncertainty in a prediction like this:
The set of confusion classes was small (15 classes divided into several subgroups), so I decided to look at the pictures of the leaves and check if I can classify them. Here is the result:
I must admit that Quercus’ (Oak) leaves look almost the same for different subspecies. I assume, that I could distinguish Eucalyptus from Cornus, but the classification of subspecies seems complicated to me.
Can you really see the random forest for the leaves?
The key idea of my solution was to create another classifier, which will make predictions only for confusion classes. The first one I tried was RandomForestClassifier
from sklearn
and it gave excellent result after the tuning of hyperparameters. The random forest was trained on the same data as logistic regression, but only the objects from confusion classes were used.
If logistic regression gave uncertain predictions for an object then the prediction of the random forest classifier was used. Random forest gave the probabilities for 15 classes, the rest assumed to be absolute .
The final pipeline is the following:
Threshold
The leaderboard score was calculated on the whole dataset. That is why some risky approaches could be used in this competition.
Submissions are evaluated using the multi-class logloss.
where - number of objects and classes respectively, is the prediction and is the indicator: if object is in class , otherwise it equals to . If the model correctly chose the class, then the following approach will decrease the overall logloss, otherwise, it will increase dramatically.
After thresholding, I got the score of . That’s it. All the labels are correct.
What else?
I’ve tried several methods that showed appropriate result but was not used in the final pipeline. Moreover, I had some ideas on how to make the solution more elegant. In this section, I’ll try to discuss them.
XGBoost
XGBoost by dmlc is a great tool. I’ve used it in several competitions before and decided to train it on the initially extracted features. It demonstrated the same score as logistic regression or even worse, but the time consumption was a way bigger.
Submission blending
Before I came up with an idea of Random Forest as the second classifier, I tried different one-model methods. Therefore I collected lots of submissions. The trivial idea is to blend the submissions: to use the mean of the predictions or weighted mean. The result did not impress me either.
Neural networks
Neural networks were one of the first ideas I tried to implement. Convolutional Neural Networks are good feature extractors, therefore, they could be used as a first-level model or even as a main classifier. The original images came with different resolution. I rescaled them to . The training of CNN on my laptop was too time-consuming to choose the right architecture in reasonable time, so I declined this idea after several hours of training. I believe, that CNNs could give accurate predictions for this dataset.
Bio
I am Ivan Sosnovik. I am a second-year master student at Skoltech and MIPT. Deep learning and applied mathematics are of great interest to me. You can visit my GitHub to check some stunning projects.