The House Prices playground competition originally ran on Kaggle from August 2016 to February 2017. During this time, over 2,000 competitors experimented with advanced regression techniques like XGBoost to accurately predict a home’s sale price based on 79 features. In this blog post, we feature authors of kernels recognized for their excellence in data exploration, feature engineering, and more.
In these sets of mini-interviews, you’ll learn:
- how writing out your process is an excellent way to learn new algorithms like XGBoost;
- if your goal is to learn, sharing your approach and getting feedback may be more motivating than reaching for a top spot on the leaderboard; and
- just how easy it is to fall into the trap of overfitting your data plus how to visualize and understand it.
Read on or click the links below to jump to a section.
We’ve also renewed the challenge as a new Getting Started Competition, so we encourage you to fork any of these kernels or try out something completely new to expand your machine learning skill set.
Data Exploration/XGBoost
Fun with Real Estate Data
Created by: Stephanie Kirmer
Language: R
What motivated you to create it?
I had just learned about XGBoost, and was interested in doing a start to finish project comparing xgb to regression and random forest side by side. I though others might also want to be able to see how the procedures compared to each other. I learn a lot by doing, I am a hands-on learner as code is concerned, so this was a great opportunity for me.
What did you learn from your analysis?
Oh goodness, I learned a lot. I have since learned even more, and my xgb implementations today are better, I think, but in this I learned about setting up the script so it would be smooth and make sense to the reader, so it would run at a reasonable speed (not that easy, random forests are slow), and I really put some work into the feature engineering and just simply learned a lot about how houses are classified/measured.
Can you tell us about your approach in the competition?
Entering the competition with my results was kind of an afterthought. I started work early on in this competition because it was a dataset that I could actually work with on the kernel structure or on my local machine (most competition datasets on Kaggle are way too big for my hardware/software to manage). I wanted to write the kernel first, and then it was just easy to enter from that interface so I did. I’m pretty proud of the results given it was my first reasonably competent implementation of XGBoost!
Comprehensive Data Exploration with Python
Created by: Pedro Marcelino
Language: Python
What motivated you to create it?
My main motivation was learning. Currently, I am looking to develop my skills as a data scientist and I found out that Kaggle is the best place to do it. According to my experience, any learning process gets easier if you can relate your study subject with something that you already know. In Kaggle you can do that because you can always find a dataset to fall in love with. That is what happened in my case. Having a background in Civil Engineering, the ‘House Prices: Advanced Regression Techniques’ competition was an obvious choice, since predicting house prices was a problem I had already thought about. In that sense, Kaggle works great for me and it is the place to go when I want to learn data science topics.
What did you learn from your analysis?
The most important lesson that I took from my analysis was that documenting your work is a great advantage. There are many reasons why I believe in this. First, writing helps clarify thinking and that is essential in any problem solving task. Second, when everything is well documented, it is easier to use your work for future reference in related projects. Third, if you document your work, you will improve the quality of the feedback you will receive and, consequently, you get more chances to improve your work. In the beginning, it might feel frustratingly slow to document everything, but you will get faster with practice. In the end, you will realize that it can be a funny exercise (and feel compelled to even add some jokes to your text).
Can you tell us about your approach in the competition?
My approach was to focus on a specific aspect of the data science process and look for a solid bibliographic reference that could guide my work. For this competition, I opted to improve my skills in the aspect of data exploration. As a bibliographic reference, I used the book ‘Multivariate Data Analysis’ (Hair et al., 2014), in particular its Chapter 3 ‘Examining your data’. Since the book is well organized and written in a straightforward way, it is easy to follow it and use it as bridge between theory and practice. This is the approach that I usually follow when I am learning the basics: define the problem that I want to solve, look for related references and adapt them to my needs. Nothing more than ‘standing on the shoulders of giants’.
Pre-Processing and Feature Engineering
A Study on Regression Applied to the Ames Dataset
Created by: Julien Cohen Solal
Language: Python
What motivated you to create it?
Playground competitions are all about learning and sharing. I’m no expert at all, and was even less so when I published this kernel, but most of what I learned about machine learning models, I learned on Kaggle. If I recall correctly, I published it pretty early in the competition. There were already a few really interesting kernels, but I felt some of the work I had done hadn’t been presented anywhere else thus far, so there was my opportunity to share.
Around this period, I also just finished reading a book which I really liked (Python Machine Learning by Sebastian Raschka), and I couldn’t wait to apply some of the things I had read about on a dataset which looked interesting to me. When my first few submissions scored pretty decent results (at that moment at least), I figured my code was probably good enough that a few people could probably learn a thing or two reading it, and I could also maybe get some feedback to improve it.
What did you learn from your analysis?
Well, strictly speaking about the topic of house prices, it confirmed what was pretty much universally known: it’s all about location and size. Other features like overall quality matter as well, but much, much less.
Now about applying machine learning to real-world problems, this was a real learning experience for me. First of all, feature engineering is fun! It’s definitely my favorite part of the overall process, the creativity aspect of it, especially on a dataset like this one where features aren’t anonymized and you can really focus on trying to improve your dataset with new features that make sense, not just blindly combine features and try random transformations on those.
Also, applying regularization when using Linear Regression is pretty much essential. It penalizes extreme parameter weights and as such allows us to find a much better bias/variance tradeoff and to avoid overfitting.
Can you tell us about your approach in the competition?
Right from the start, I knew I wouldn’t try to aim for the top spots (I wouldn’t be able to anyway!). I’m not really interested in stacking tens or hundreds of finely-tuned models, which seems to be pretty much necessary to win any Kaggle competition these days. I was in it to test some techniques I had heard about, and learn some new ones via the forum or the kernels.
I tried to mix the most interesting ideas I could read about in the kernels that were already published, mix those with my own, and go from there. I was using cross-validation to validate every single preprocessing concept, but it was at times a frustrating process, as the dataset is really small, and it was hard to distinguish the signal from the noise. Some features that made so much sense to me were actually downgrading my score. All in all it was still a great learning experience, and I’m happy I have been able to share some knowledge as well.
Insightful Visualizations
A Clear Example of Overfit
Created by: Osvaldo Zagordi
Language: R
What motivated you to create it?
What I liked about this competition was its small dataset, which allowed me to experiment quickly on my old laptop, and the fact that everybody can easily have a feeling of what a house price is. Your RMSE is zero-point-something, but how does it translate on the prediction? Am I off the price by one thousand, ten thousand, or one hundred thousand? Plotting the predicted vs. the actual prices was then a natural thing to do.
What did you learn from your analysis?
Trivially, I learned that overfitting can hit me much more then I would have expected. The prediction of gradient boosted trees on the training set is extremely good, almost perfect! Of course, I was in a situation of slightly more than 50 predictors and 700 observations. Still, it was remarkable to observe the trees adapt so perfectly to the observations.
Can you tell us about your approach in the competition?
Competing at high level in Kaggle quickly becomes hard. I was experimenting exactly with a method aimed at avoiding overfitting when I made that observation. In the end I did not write a kernel on this technique (maybe next time), and I did not even invest much energy in trying to climb the leaderboard. But I decided it was worth writing a kernel showing the example of the overfit. I thought it was somehow “educational”. In general, communicating the results is my favourite part of the whole analysis, even more so for this playground competition.
Leaf Classification Playground Competition: Winning Kernels: Read more about this competition that challenged over 1,500 Kagglers to accurately identify 99 different species of plants based on a dataset of leaf images. In a series of mini-interviews, authors of top kernels from the competition share everything from why you shouldn't always jump straight to XGBoost to visually interpreting PCA and k-means.