The Avito Duplicate Ads Detection competition ran on Kaggle from May to July 2016 and attracted 548 teams with 626 players. In this challenge, Kagglers sifted through classified ads to identify which pairs of ads were duplicates intended to vex hopeful buyers. This competition, which saw over 8,000 submissions, invited unique strategies given its mix of Russian language textual data paired with 10 million images. In this interview, team ADAD describes their winning approach which relied on feature engineering including an assortment of similarity metrics applied to both images and text.
The basics
What was your background prior to entering this challenge?
Mario Filho: My background in machine learning is completely “self-taught”. I found a wealth of education materials available online through MOOCs, academic papers and lectures in general. Since February 2014 I have worked as a machine learning consultant in projects from small startups and Fortune 500 companies.
Gerard Toonstra: I worked as a scientific developer at Thales for 3 years, which introduced me into more scientific development methods and algorithms. Most of the specific ML knowledge was acquired through courses on Coursera and just getting your hands dirty in Kaggle competitions and the forum interactions.
Kele Xu: I am a PhD student with the topic on "silent speech interface".
Praveen Adepu: Academically, I have Bachelor of Technology and working as full stack BI Technical Architect/Consultant.
Gilberto Titericz: I'm graduated in electronic engineering and M.S. in wireless communication area. In 2011 I started to learn data science by myself and after joining Kaggle I started to learn even more.
How did you get started competing on Kaggle?
Mario Filho: I heard about Kaggle when I was taking my first courses about data science, and after I learned more about it I decided to try some competitions.
Gerard Toonstra: I was active in the Netflix grand prize quite some while ago and at the end, it pointed to the Kaggle site as another potential source of getting your hands dirty. I ignored that up to July last year when I decided to start on the Avito click challenge. It's pretty cool that exactly one year later, after some avid kaggling, I'm part of the 3rd place submission.
Kele Xu: I participated in KDD Cup 2015, and ended as 40th there. That was my first competition. After KDD Cup 2015, I became a Kaggler, which helped me to learn a lot during the last year.
Praveen Adepu: I am new to machine learning, R and Python. I like learning by doing and I realised Kaggle is the best fit for my learning by participating in competitions. Initially I experimented with few competitions to find learning patterns and started working seriously from last 6 months and I learnt a lot in the past 6 months and looking forward to learn more from Kaggle.
Gilberto Titericz: After the Google AI challenge 2011 I was searching on the internet for other online competition platform and found Kaggle.
What made you decide to enter this competition?
Mario Filho: At the time it was the only competition that had a reasonable dataset size and was not too focused on images. So I thought it would be possible to get a good result with feature engineering and the usual tools.
Gerard Toonstra: I feel that my software engineering background can give me an edge in certain competitions. The huge amount of data requires a bit of planning and modularizing the code. I started doing that in the Dato competition, continued doing it a bit better for Home Depot and in Avito I started some more serious pipelining and feature engineering. It's not that the pipelines are sophisticated, the only purpose is to reduce feature building time. Instead of waiting for one script to finish in 8 hours, I just build features in parts and glue/combine them together.
Kele Xu: When I decide to go to a new competition, I would like to select some topic which I have no experience before. On that case, I will take more from the competition. In fact, before this competition, I have few experiments on NLP topics. That’s the main reason I entered this competition.
Praveen Adepu:
- I like feature engineering, this completion requires lot of hand craft feature engineering
- Very high LB bench mark score attracted me a lot to test my learning skills from previous competitions.
I left with lot of feature engineering ideas even after passing the bench mark in couple of weeks so planned to work bit more time on this competition.
Gilberto Titericz: Lately I have interest in competitions involving image processing and deep learning.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
Mario Filho: I stemmed and cleaned the text fields, then used tf-idf to compute similarities between them. Used the hash similarity script available in the forums to compute the similarity between images. After creating and testing lots of features, I used XGBoost to train a model.
Gerard Toonstra: I had one feature array for textual features, which is essentially pretty much what others did in the forum. Then I built another feature set with additional text features, a minhash similarity feature, tf-idf+svd over description and then I did work on image hashes: ahash,phash,dhash with min, max, avg features and the count of zero divided by max number of images. I kept focusing on making these normalized distance metrics, so divide by 'length' where appropriate or the number of images. As things advanced, I also did image histograms on RGB, histograms on HSV and Earth Mover Distance between images.
I ended up only using four model types: logistic regression over tf-idf features over cleaned text, XGB, Random Forests and FTRL.
Kele Xu: Here, I am only using XGBoost. XGBoost is really competitive here. My best single XGBoost model can get to top 14 in both of public and private Leaderboard.
Praveen Adepu: Not any special pre-processing methods but just followed all best practices of feature engineering and taken care of processing times when creating new features.
Used - XGBoost, h2o Random Forest, h2o deep learning, Extra Trees and Regularised Greedy Forest.
Gilberto Titericz: Most preprocessing was done in text features, like stop words removal and stemming. Also I build a very interesting feature using deep learning. For this I used the MXNet Inception 21k pre-trained model to predict one class for each one of the 10M images in dataset. Measuring some statistics between those classes helped improve our models.
Supervised learning methods used are based in gradient boosting (XGBoost), RandomForest, ExtraTrees, FTRL, Linear Regression, Keras and MXNet. We also used libffm and RGF algorithms, but we dropped it in the end.
What have you taken away from this competition?
Mario Filho: I learned new techniques for calculating image similarity based on hashes, and 2 or 3 Russian words.
Gerard Toonstra: The benefit of working in teams and learning from that experience. I really wanted to learn about stacking. There is a good description on mlwave.com, but when you get around to actually doing that in practice, there are still questions that pop up. The other observation is that as a competition draws to a close, it's all hands on deck and you have to put the last effort in, especially the last week. I did not have enough hardware resources and used in the most extreme case three 16-cpu machines spread across three geographical zones on Google Cloud. Despite the hardware challenges and exploration we had to do, I noticed some solo competitors who were able to gain ground quickly and rocket up. I really respect that. It tells me that those guys have acquired so much experience that they consistently make right decisions on what to spend time on and I still have a long way to go.
Kele Xu: I have learned a lot from our teammates, especially on the hyper-parameter optimization for XGBooost model and the ensemble techniques. Also, I got to know how to do some NLP tasks.
Praveen Adepu: I think joining the team at right time and with right team members is best decision I have made in this competition.
I started with two goals while joining the team:
- Learn from the team while contributing to the best of my knowledge
- Learn ensembling/stacking
I would like to take this opportunity to say thank you my entire team Gilberto, Mario, Gerard and Kele.
Gilberto - thank you for teaching many many concepts including stacking
Mario - thank you for guiding me while experimenting many models
Gerard - never forget our initial discussion on feature engineering
Kele - thank you for the invite to join the team and making this happen
I learnt a lot from this team in one month than alone in the past 6 months and looking forward to work with them in near future.
Gilberto Titericz: I learned a lot about text and image distance metrics.
Do you have any advice for those just getting started in data science?
Mario Filho: if you plan to apply machine learning, try to understand the fundamentals of the models, concepts like validation, bias and variance. Really understand these concepts and try to apply them to data.
Gerard Toonstra: The first thing is to understand cross validation score and its specific relation to the leaderboard in that competition and how to establish a sound folding strategy in general. After that, I recommend establishing a very simple model based on features that do not overfit and evaluate the behavior. Then add a couple of features one by one that are unlikely to overfit and keep evaluating. Record the difference between CV and LB when you make submissions. As you establish confidence in your local CV, start working out more features and do a couple of in-between checks to ensure you're not overfitting locally; especially depending on how some aggregation features are built, they can have overfitting effects.
When you get started, don't allow yourself to get bogged down by endless hours of parametrizing and hyperparameterizing the models. You may get the extra +0.00050 to jump up 3 positions, but it's not what most on the LB are doing. Figure out how to create new informative features, which features should be avoided and when you grow tired of that, only then spend some time optimizing what you have.
Kele Xu: In fact, I think Kaggle is quite a good platform for both the data scientists and those who are just getting started, as Kaggle can provide a platform for the new guys to test how each algorithm performs.
As to the technique suggestions, I would say a solid local CV is the key point to win a competition. Usually, I will spend 2-4 weeks to test my local CV.
Feature engineering is more important than hyper-parameter optimization.
Although XGBoost is enough to get a relative good rank in some competitions, we should also test mode models to add the diversity of the models, not just select different seed or set different max depth for tree-based methods.
The last thing is: just test your ideas, and you can get some feedback from the LB. When you get to higher rank in one competition, you will have the motivation to make it better.
Praveen Adepu: I think, I do call these my experience rather than advice:
- Start with clear objectives, learn slowly and progressively with basics and learn quickly with advanced concepts and end with mastering
- Never be afraid to start and fail
- Clear understanding of local CV, underfit and overfit
- Start learning from feature engineering and end with stacking
Gilberto Titericz: Read a lot of related material. Search for problems to solve. Read about problems' solutions. And make your fingers work, program and learn from your mistakes.
Bios
Mario Filho is a machine learning consultant focused in helping companies around the world use machine learning to maximize the value they get from data to achieve their business goals. Besides that, he mentors individuals who want to learn how to apply machine learning algorithms to real world data sets.
Gerard Toonstra graduated as a nautical officer+engineer, but mostly worked as software engineer and started his own company in Brazil working with drones for surveying. He now works as scrum master for the BI department at Coolblue in The Netherlands.
Kele Xu is a PhD student and writing his PhD thesis at Langevin Institute (University of Pierre and Marie Curie). His main interests are include: silent speech recognition, machine learning and computer vision.
Praveen Adepu is currently working as BI Technical Architect/Consultant at Fred IT Melbourne based IT product company and main interests in machine learning and data architecture.
Gilberto Titericz is an electronics engineer with a M.S. in telecommunications. For the past 16 years he's been working as an engineer for big multinationals like Siemens and Nokia and later as an automation engineer for Petrobras Brazil. His main interests are in machine learning and electronics areas.
Kernel Corner
Getting hashes from images was an important strategy in detecting similarities across the 10 million images in this competition as Gerard Toonstra explains:
phash calculates a hash in a special way such that it reduces the dimensionality of the source image into a 64-bit number. It captures the structure of the image. When you then subtract two hashes, you get a number that resembles the 'structural distance' between the two images. The higher the number, the more the distance between the two.
Kaggler Run2 shared code on Kaggle Kernels which allowed Kagglers to incorporate distance metrics from images without performing heavy duty image processing or deep learning.