In August, over 350 new datasets were published on Kaggle, in part sparked by our $10,000 Datasets Publishing Award. This interview delves into the stories and background of August's three winners–Ugo Cupcic, Sudalai Rajkumar, and Colin Morris. They answer questions about what stirred them to create their winning datasets and kernel ideas they'd love to see other Kagglers explore.
If you're inspired to publish your own datasets on Kaggle, know that the Dataset Publishing Award is now a monthly recurrence and we'd love for you to participate. Check out this page for more details.
First Place, Grasping Dataset by Ugo Cupcic
Can you tell us a little about your background?
I’m the Chief Technical Architect at the Shadow Robot company. I joined Shadow in 2009 after working briefly for a consulting firm in London. I joined Shadow as a software engineer when it was still a very small company. I then evolved as the needs of the company diversified while growing - senior software engineer, head of software and finally CTA. My background is in bio-informatics (!) and AI. Feel free to connect on Linkedin if you want to know more.
What motivated you to share this dataset with the community on Kaggle?
I had no idea there was a prize! At Shadow, we have a strong open source culture. As you can see on github, we share as much of our code as we can. We’re also active in the ROS community. It was then logical for us to share this dataset.
I was personally very keen to step into the Kaggle community. For someone who is more of a roboticist than a pure Machine Learning person, Kaggle seems to be the de facto platform for sharing ML problems, datasets, etc. I’m always looking to get fresh ideas from people who’re working on relevant fields, but can be outside of robotics.
What have you learned from the data?
It’s a first delve into using machine learning to robustly predict a grasp stability. The dataset can’t be used to deploy the trained algorithm on a real robot (yet). We’d have to invest more efforts in a more robust simulation for grasping before this would happen.
Determining whether a grasp will succeed or fail before you lift the object - or before the object falls is a hot topic. In the video below, you can see the live grasp prediction working in the simulated sandbox. Grasp stability measurement is well studied in robotics, but it often assumes having a good knowledge of the object you’re grasping and its relation to the robot. My goal is to see how far we can go without this.
If you want the full details behind the dataset you should take a look at my associated Medium story.
What questions would you love to see answered or explored in this dataset?
There’s the obvious question: how can I predict accurately whether my grasp will fail or succeed given that dataset. I’m definitely interested in learning more about your ideas on how to tackle it as a machine learning problem.
The other questions I’m very interested in is what sort of data would you like to have in order to build a better prediction algorithm?
So don’t hesitate to get in touch, either on Kaggle or on twitter!
Second Place, Cryptocurrency Historical Prices by Sudalai Rajkumar
Can you tell us a little about your background?
I am Sudalai Rajkumar (aka SRK) working as a Lead Data Scientist with Freshworks. I did my graduation from PSG college of Technology and Professional certification in Business Analytics from IIM Bangalore. I have been in the Data Science field right from the start of my career and doing Kaggle for almost five years now.
What motivated you to share this dataset with the community on Kaggle?
I was hearing a lot of buzz about cryptocurrencies and wanted to explore this field. I initially thought Bitcoin was the only cryptocurrency available. But when I started exploring it, I came to know that there are hundreds of cryptocurrencies available.
So, I was looking for datasets to understand more about them (at least the top 15-20 cryptocurrencies). I was able to find datasets for Bitcoin and Ethereum, but not for others on the internet at all. I thought this dataset would be useful for me, then it would be useful for others too and so I shared it. Thanks to Coin Market Cap, I was able to get the historical data for all these different currencies.
My next thought was to understand the price drivers of these coins. There are so many features of a block chain, like number of transactions (which affect the waiting time for the transactions to get confirmed), difficulty level of mining a new coin (which incentivizes the miners), etc., which could bear an effect on the prices of altcoins. I was able to get these details for Bitcoin (thanks to Blockchain Info) and Ethereum (thanks to EtherScan).
What have you learned from the data?
I learned a lot about Cryptocurrencies through this exercise. I was completely new to this crypto world when I started this exploration. Now I have got more idea about how they work, what are the different top cryptocurrencies etc.
Also I got to know about the high price volatility of these currencies from this data, which makes them highly risky but at the same time highly rewarding if we choose the right one. Hoping to make some investments now with the prize money
What questions would you love to see answered or explored in this dataset?
Some interesting answers / explorations could be
At individual coin level
- Price volatility of the coins
- Seasonal trends in the coins, if any
- Predicting the future price. A good attempt based on NNs can be seen here.
- Effect of other parameters on the price (for Bitcoin and Ethereum)
At the overall level
- How does the price changes of the coins compare with each other. One good kernel on this could be seen here.
- How does the market cap of individual ones changed over time.
Third Place, Favicons by Colin Morris
Can you tell us a little about your background?
I studied computer science at the University of Toronto and did a master's in computational linguistics. After university, I worked for about 3 years at Apple on a team that did AI-related iOS features. Since leaving Apple, I've enjoyed flitting around working on weird personal projects (most of them involving data science in some way).
What motivated you to share this dataset with the community on Kaggle?
I see a lot of potential in it for experiments in unsupervised deep learning, particularly when working with limited hardware or time. There's a classic image dataset called MNIST which is the go-to if you're making a deep learning tutorial, or benchmarking some new technique. It consists of 70,000 images of handwritten digits, downscaled to 28x28 pixels. The images in the favicon dataset are also tiny, but the great thing about them is that they're naturally tiny. The 290,000 16x16 images in the dataset were designed to be viewed at that size.
What have you learned from the data?
I've learned that, while most sites follow the convention of having square favicons that are somewhere between 16x16 and 64x64, there are plenty of weird exceptions. I published a kernel where you can see some of the most extreme examples. Aesthetically, my favourites are the ones that are smaller than 10x10. I think they belong in MoMA.
As a result of sharing the dataset around, I also learned that my idea wasn't as unique as I'd thought. Shortly after I published the favicon dataset, some researchers from ETH Zurich released a sneak peek of their own 'Large Logo Dataset', along with some really cool preliminary results from training a neural network to generate new icons. (When I shared a link to the dataset on Twitter, I semi-jokingly suggested that "someone should train a GAN on these". I was tickled when I got a reply from one of the researchers saying "We did!").
What questions would you love to see answered or explored in this dataset?
I think the most interesting application of the data is in training generative machine learning models - i.e. teaching models to draw new favicons after seeing many examples. Generative models have been a hot area in machine learning recently, with some new architectures coming out that have showed very impressive results. But most of the work has been applied to natural, photographic images (recent work with Google's QuickDraw doodle dataset is a very cool exception). I'm very interested in how well these generative models deal with non-photographic images. These have much less detail than photographs, but on the other hand, there's the difficulty of dealing with a variety of different art styles and different subjects depicted.
I think it's also ripe for cool visualizations. Someone on twitter made a beautiful mosaic of the first 1,000 images in the dataset, arranged by colour. I'd love to see more stuff like that - perhaps using some different clustering methods.