Avito Duplicate Ads Detection, Winners' Interview: 1st Place Team, Devil Team

Image may be NSFW.
Clik here to view. Avito Duplicate Ads Competition

The Avito Duplicate Ads Detection competition ran from May to July 2016. This competition, a feature engineer's dream, challenged Kagglers to accurately detect duplicitous duplicate ads which included 10 million images and Russian language text. In this winners' interview, Stanislav Semenov and Dmitrii Tsybulevskii describe how their single XGBoost model scores among the top three and their ensemble snagged them first place. Stanislav's third Avito competition was a special one, too; his first place win as part of Devil Team boosted him to #1 Kaggler status!

The basics:

What was your background prior to entering this challenge?

Dmitrii Tsybulevskii: I hold a degree in Applied Mathematics, and I’ve worked as a software engineer on computer vision and machine learning projects.

Image may be NSFW.
Clik here to view.

Dmitrii Tsybulevskii on Kaggle.

Stanislav Semenov: I hold a Master's degree in Computer Science. I've worked as a data science consultant, teacher of machine learning classes, and quantitative researcher.

Image may be NSFW.
Clik here to view.

Stanislav Semenov on Kaggle.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Dmitrii Tsybulevskii: Yes, I’ve worked on image duplicate detection and text classification problems before, and I know the Russian language.

Stanislav Semenov: This is my 3rd Avito competition on Kaggle! And yes, I know the Russian language, too.

What made you decide to enter this competition?

Dmitrii Tsybulevskii: A lot of raw data, both text and images - large field for feature engineering, and I like feature engineering.

Stanislav Semenov: A large area for feature engineering.

Let’s get technical:

What preprocessing and supervised learning methods did you use?

It was all about feature engineering. So we tried to generate as many strong features as we could. XGBoost was the only learning method used. Our single XGBoost model can get to the top three! Our final model just averaged XGBoost models with different random seeds.

We used following text preprocessing:

stemming
lemmatization
transliteration

Our features:

different similarity features between title-title, title-description, title-json like Cosine distance, Levenshtein, Jaccard, NCD, etc
different features of exact match of words in title, description
general features such as prices, places, number of images, exact match of title, description, etc
different similarity features of trained w2v models
LSI features of ads union, ads XOR
one-hot-encoding of categoryID
ratios of the title, description, json lengths
distances between BRIEF image descriptors
distances between color histogram in LAB space, HOG histograms
distances between features, extracted with pretrained neural network MXNet BN-Inception-21k, first averaged PCA components of this features
number of matches computed with AKAZE local visual feature detector & descriptor

The most important trick was to submit our best result before 2 hours of competition ending. That was EXTREMELY fun! =)

Did knowing Russian help you in this competition? If so, how?

Stanislav Semenov: Not so much. Of course, you can see where your model is wrong and a close look at the ads. But it did not give any new information.

Dmitrii Tsybulevskii: On the one hand it was comfortable to work with Russian texts, because you know what the ads were about. On the other hand we had no killer features based on it.

Which tools did you use?

Jupyter Notebook, XGBoost, Pandas, scikit-learn, VLFeat, OpenCV, MXNet

What was the run time for your winning solution?

Feature extraction: 3-4 days
Model training: 1-2 weeks

Words of wisdom:

What have you taken away from this competition?

Dmitrii Tsybulevskii: I have learned about NCD distance and some convenient things about team cooperation.

Stanislav Semenov: A lot of fun and much needed ranking points. Image may be NSFW.
Clik here to view.

Do you have any advice for those just getting started in data science?

Stanislav Semenov: Solving practical problems is your best friend.

Dmitrii Tsybulevskii: Kaggle is a great platform for getting new knowledge.

Bio

Stanislav Semenov is a Data Scientist and Quantitative Researcher.

Dmitrii Tsybulevskii is a software engineer. He holds a degree in Applied Mathematics. His main interests are computer vision and machine learning.

Avito Duplicate Ads Detection, Winners' Interview: 1st Place Team, Devil Team | Stanislav Semenov & Dmitrii Tsybulevskii

The basics:

What was your background prior to entering this challenge?

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

What made you decide to enter this competition?

Let’s get technical:

What preprocessing and supervised learning methods did you use?

Did knowing Russian help you in this competition? If so, how?

Which tools did you use?

What was the run time for your winning solution?

Words of wisdom:

What have you taken away from this competition?

Do you have any advice for those just getting started in data science?

Bio

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112