The Avito Duplicate Ads Detection competition ran from May to July 2016. This competition, a feature engineer's dream, challenged Kagglers to accurately detect duplicitous duplicate ads which included 10 million images and Russian language text. In this winners' interview, Stanislav Semenov and Dmitrii Tsybulevskii describe how their single XGBoost model scores among the top three and their ensemble snagged them first place. Stanislav's third Avito competition was a special one, too; his first place win as part of Devil Team boosted him to #1 Kaggler status!
The basics:
What was your background prior to entering this challenge?
Dmitrii Tsybulevskii: I hold a degree in Applied Mathematics, and I’ve worked as a software engineer on computer vision and machine learning projects.
Stanislav Semenov: I hold a Master's degree in Computer Science. I've worked as a data science consultant, teacher of machine learning classes, and quantitative researcher.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
Dmitrii Tsybulevskii: Yes, I’ve worked on image duplicate detection and text classification problems before, and I know the Russian language.
Stanislav Semenov: This is my 3rd Avito competition on Kaggle! And yes, I know the Russian language, too.
What made you decide to enter this competition?
Dmitrii Tsybulevskii: A lot of raw data, both text and images - large field for feature engineering, and I like feature engineering.
Stanislav Semenov: A large area for feature engineering.
Let’s get technical:
What preprocessing and supervised learning methods did you use?
It was all about feature engineering. So we tried to generate as many strong features as we could. XGBoost was the only learning method used. Our single XGBoost model can get to the top three! Our final model just averaged XGBoost models with different random seeds.
We used following text preprocessing:
- stemming
- lemmatization
- transliteration
Our features:
- different similarity features between title-title, title-description, title-json like Cosine distance, Levenshtein, Jaccard, NCD, etc
- different features of exact match of words in title, description
- general features such as prices, places, number of images, exact match of title, description, etc
- different similarity features of trained w2v models
- LSI features of ads union, ads XOR
- one-hot-encoding of categoryID
- ratios of the title, description, json lengths
- distances between BRIEF image descriptors
- distances between color histogram in LAB space, HOG histograms
- distances between features, extracted with pretrained neural network MXNet BN-Inception-21k, first averaged PCA components of this features
- number of matches computed with AKAZE local visual feature detector & descriptor
The most important trick was to submit our best result before 2 hours of competition ending. That was EXTREMELY fun! =)
Did knowing Russian help you in this competition? If so, how?
Stanislav Semenov: Not so much. Of course, you can see where your model is wrong and a close look at the ads. But it did not give any new information.
Dmitrii Tsybulevskii: On the one hand it was comfortable to work with Russian texts, because you know what the ads were about. On the other hand we had no killer features based on it.
Which tools did you use?
Jupyter Notebook, XGBoost, Pandas, scikit-learn, VLFeat, OpenCV, MXNet
What was the run time for your winning solution?
Feature extraction: 3-4 days
Model training: 1-2 weeks
Words of wisdom:
What have you taken away from this competition?
Dmitrii Tsybulevskii: I have learned about NCD distance and some convenient things about team cooperation.
Stanislav Semenov: A lot of fun and much needed ranking points.
Do you have any advice for those just getting started in data science?
Stanislav Semenov: Solving practical problems is your best friend.
Dmitrii Tsybulevskii: Kaggle is a great platform for getting new knowledge.
Bio
Stanislav Semenov is a Data Scientist and Quantitative Researcher.
Dmitrii Tsybulevskii is a software engineer. He holds a degree in Applied Mathematics. His main interests are computer vision and machine learning.