Does every painter leave a fingerprint? Accurately distinguishing the artwork of a master from a forgery can mean a difference in millions of dollars. In the Painter by Numbers playground competition hosted by Kiri Nichol (AKA small yellow duck), Kagglers were challenged to identify whether pairs of paintings were created by the same artist.
In this winner's interview, Nejc Ilenič takes us through his first place solution to this painter recognition challenge. His combination of unsupervised and supervised learning methods helped him achieve a final AUC of 0.9289. The greatest testament to his final model's performance? His model generally predicts greater similarity among authentic works of art by Johannes Vermeer compared to imitations by the fraudulent artist, Han van Meegeren.
The Basics
What was your background prior to entering this challenge?
I’m currently finishing my master’s degree in computer science at University of Ljubljana. I began learning about data science five months before entering this competition by taking a data mining course offered by my faculty.
What made you decide to enter this competition?
At the beginning of the course I remember being thrilled by the fact that one can predict digits from images by writing only few lines of Python code (i.e. implementing logistic regression). I soon realized that this is something I want to do in life so competing on Kaggle seemed like a reasonable next step to put my newly acquired skill set to the test. I’ve chosen this particular competition mostly because I find the domain, which the data originates from, intriguing.
Let’s get technical
How did you tackle the problem and what methods did you use?
First I will briefly depict the dataset and the preprocessing methods I’ve used and after that I will describe how I have built and validated a predictive model. The complete source code of the project along with the description of the approaches can be found in this GitHub repository.
The training set is unbalanced and some classes are only present in the training set and some only in the test set. Additionally input images are of various dimensions. There are 79433 instances and 1584 unique painters in the training set and the test set is composed of 23817 instances. Predictions for approximately 22M pairs needed to be made for the submission.
The plot below shows number of paintings for each of the 1584 painters in the training set.
Labeled images were split into training (0.9) and validation (0.1) sets in a stratified manner resulting in 71423 training examples and 8010 validation examples belonging to 1584 classes.
The model I’ve built assumes fixed-size inputs, so the first preprocessing step was to resize each image’s smallest dimension to 256 pixels (retaining the aspect ratio) and then cropping it at the center of the larger dimension, obtaining 256x256 images. Some information gets lost during this process and an alternative approach where multiple crops are taken from the same image was considered, but not used for the final solution due to much longer training times (bigger, but more correlated training set). Furthermore, mean values were subtracted from each feature in the data and the obtained values were normalized by dividing each dimension by its standard deviation. Preprocessing data statistics were computed from the subset of training instances. During the training phase random transformations (rotations, zooms, shifts, shears and flips) were applied to data in order to reduce overfitting. The latter assures that our model only rarely sees exactly the same example more than once.
There were two main approaches considered for verifying whether two instances belong to the same class. The unsupervised method involves training a model that can predict one of the 1584 classes and then taking a dot product of the two class distribution vectors (softmax outputs). The supervised method is an end-to-end metric learning approach called siamese network. The main idea is to replicate the model once for each input image and merge their outputs into a single vector, that can then be used to directly predict whether the two images were painted by the same artist. An important aspect of this architecture is that the weights of both models are shared and during backpropagation the total gradient is the sum of the gradients contributed by the two models. Since the model trained for the unsupervised technique can also be used in the siamese architecture, most of the effort went into the multi-class painter recognition task.
The depiction below illustrates the architecture of the final convolutional neural network with non-linearities, dropouts and batch normalization layers omitted. 3x3 convolutional filters with stride 1 are used to produce feature maps, that are two neurons smaller along each of the two dimensions, than their input volumes. Zero padding is then used to retain the original shape and 2x2 max pooling with stride 2 halves the number of neurons along each of the two dimensions. Non-linearities are applied to convolution and fully connected outputs using the PReLU function (Leaky ReLU with trainable slope parameter in the negative part). Dense layers at the end of the architecture are the reason why fixed-size inputs need to be fed to the network. The model is regularized using dropout, batch normalization layers and L2 weight penalties.
300 epochs are needed for model to converge to the local minima using the Adam optimizer with 7.4e-05 learning rate and batch size of 96 examples. During training the cross-entropy loss was minimized.
Neural networks can be used as descriptor generators that produce lower dimensionality representations of input instances. One can think of them as automatic feature extractors. Such embeddings are obtained by simply taking the 2048 dimensional output vectors of the penultimate layer. To check whether there is any internal structure in the features produced by the ConvNet I’ve used the t-SNE dimensionality reduction technique. t-SNE is a convenient algorithm for visualization of high dimensional data and allows us to compare how similar input instances are. Below are two scatter plots of some of the artwork images of randomly selected artists from the validation set. Having in mind that the network hasn’t seen those examples during training and that the t-SNE algorithm doesn’t get class labels as inputs, the visual results are quite exciting.
The public leaderboard score was calculated on 70% of the submission pairs and the private leaderboard score on the remaining 30%. The final submission was generated using the unsupervised approach for verifying the same class identity. The best single ConvNet scored 0.90717 AUC on the private leaderboard and an ensemble of 18 best ConvNets trained during the hyper parameter search process scored 0.92890 AUC on the private leaderboard. Adding more (worse) models to the ensemble started to hurt the overall performance. A single hypothesis was obtained from multiple models as a weighted average of their predictions for the painter recognition task and only then the inner product of the two averaged class distribution vectors was calculated.
Were you surprised by any of the findings?
The administrator of the competition Kiri Nichol has posted some very useful insights into the performance of the algorithm on the private, test dataset. As stated on the competition forum, an ingenious Dutch forger Han van Meegeren was slipped into the test set in order to better understand how good the model is at extracting painters’ unique styles. The forger has replicated some of the world’s most famous artists’ work, including the paintings of Johannes Vermeer. Below is a pairwise comparison table of my best submission’s predictions for van Meegeren and Vermeer examples from the test set. Based on the model’s predictions it can be seen that Vermeer’s paintings are indeed more similar to each other than van Meegeren’s paintings are to Vermeer’s paintings. It can also be seen that Vermeer’s paintings are more similar to each other than van Meegeren’s paintings are to each other, due to van Meegeren forging paintings in the style of several different artists.
Another really valuable insight concerns the extrapolation of the model to artists that were not seen during training. The results are given in the form of AUC of my final submission for two different groups of instances from the test set. The first group consists of pairs of images whose painters were present in the training set: 0.94218 AUC and the second one is composed of pairs whose artists haven’t been seen by the model before: 0.82509 AUC.
Based on the results of the competition it can be concluded that convolutional neural networks are able to decompose artwork images’ visual space based on their painters unique style. The bad news is that the described algorithm is not good at extrapolating to unfamiliar artists. This is largely due to the fact that same identity verification is calculated directly from the two class distribution vectors.
Which tools did you use?
All of the code was written in Python and the most important libraries that were used are Keras (with Theano backend), NumPy and scikit-learn.
What was the run time for both training and prediction of your winning solution?
Training of the final ConvNet took a bit more than 4 days on a single GeForce GTX TITAN X GPU, prediction of the artists for 23817 test images took around 15 minutes and the time needed for calculating the inner products for 22M submission pairs was negligible compared to the training times.
Words of wisdom
What have you taken away from this competition?
As my first Kaggle competition this was an excellent learning experience and since I’m planning to continue the work as my upcoming master’s degree thesis it was also a great opportunity for me to gain more knowledge about possible pitfalls and challenges in the domain. From this point forward my main focus will be on achieving better generalization by training an end-to-end metric learning technique called siamese network that was only briefly mentioned above.
At this point I would also like to thank Niko Colnerič, Tomaž Hočevar, Blaž Zupan, Jure Žbontar and other members of the Bioinformatics Laboratory from University of Ljubljana for their help and provision of the infrastructure.
Do you have any advice for those just getting started in data science?
I think that in order to really understand how something works, one has to implement it. This is especially important at the beginning, since one has no related knowledge to associate new reasonings to. So start by implementing simple algorithms and use those to create submissions for Getting Started competitions.
Just for fun
What is your dream job?
Working as a data scientist in a diverse domain and with people from whom I can learn a lot.
Bio
Nejc Ilenič is currently an MSc student in computer science at University of Ljubljana, Slovenia. After graduation he aspires to pursue a career as a data scientist.