As we move into 2018, the monthly Datasets Publishing Awards has concluded. We're pleased to have recognized many publishers of high-quality, original, and impactful datasets. It was only a little over a year ago that we opened up our public Datasets platform to data enthusiasts all over the world to share their work. We've now reached almost 10,000 public datasets, making choosing winners each month a difficult task! These interviews feature the stories and backgrounds of the November and December winners of the prize. This month, we're pleased to highlight:
- EEG data and code shared to combat the reproducibility crisis in neuroscience
- A handcrafted Russian version of the famous MNIST dataset for computer vision
- In-depth economic data about Darknet cocaine marketplaces
- Breast histopathology images shared to combine open-access biomedical data with crowdsourced analytics
- A wealth of weather data used to explore correlation patterns and demonstrate signal processing techniques
- Data and code uncovering the surprising everyday items sold on the dark web
While the Dataset Publishing Awards are over, you can still win prizes for code contributions to Kaggle Datasets. We're awarding $500 in weekly prizes to authors of high quality kernels on datasets. Click here to learn more »
November Winners:
First Place, EEG data from Basic Sensory Task in Schizophrenia by Brian Roach
Can you tell us a little about your background?
I am currently working as a programmer analyst in a brain imaging and electroencephalography (EEG) lab focused on schizophrenia. It is an academic research lab run by three professors in the department of psychiatry at UCSF. Prior to moving out to San Francisco, I worked at Yale University. I have a masters in statistics from Texas A&M University. Before that, I studied cognitive science at Vassar College, where I had my first exposures to EEG and computer programming.
What motivated you to share this dataset with the community on Kaggle?
I was motivated to share this dataset for several reasons. The lab recently received some funding to work on single trial EEG classification in patients with schizophrenia and comparison control subjects. In particular, we run a set of experiments like the one used in the dataset I uploaded where participants control the stimulus presentation (e.g., press a button to generate a sound) in one condition or passively observe the stimuli (e.g., listen to a series of sounds based on their previously generated sequence) in another condition. Humans and many other animals are able to suppress the response to self generated stimuli. We have observed that people with schizophrenia, relative to comparison control subjects, do not show as strong a pattern of suppression in the averaged EEG brain response, called the Event-Related Potential (ERP). While we see this in the averaged response, classification of single trials might allow us to see what features in the EEG best differentiate between these conditions. I thought sharing this dataset on Kaggle might be a way to get feedback from the community on different approaches to this binary classification problem.
The other big reason was that after attending neurohackweek at the University of Washington this Fall, I came back to the lab with concrete examples of combating the neuroscience reproducibility crisis in mind. Sharing both data and code to increase transparency should improve the research process and aid peer review. Publishing this dataset on Kaggle was a straightforward way to make both data and code available on one, easily accessible platform.
What have you learned from the data?
One of the first things that I tried to verify that everything worked with my python import was to apply the common spatial patterns (CSP) function to some of the data. It is not clear the spatial topography is as consistent across subjects as it was in the EEG grasping data. I was also able to reproduce some but not all of the ERP effects previously published in a paper using R in this notebook.
What questions would you love to see answered or explored in this dataset?
As I mentioned above, single trial classification, particularly binary classification of the button press + tone vs the passive tone playback might be used to address questions like: (1) Can we predict trial type with equivalent accuracy in both patients and controls? (2) Do the features in the EEG the best predict trial type vary between patients and controls? (3) Within the patient group, are there different sub-groups with similar feature patterns that differentiate the two trial conditions? For example, maybe some patients have more motor signal abnormalities, and others have more abnormal auditory sensory responses. Identifying these types of differences might allow future research studies to focus on patient-specific interventions (e.g., targeting motor vs auditory processing).
Second Place, Classification of Handwritten Letters, Images of Russian Letters by Olga Belitskaya
Can you tell us a little about your background?
After being a housewife for a long time, I'm returning again to the workforce. My higher educations, received 15-22 years ago, were in the field of economics and teaching of mathematics, physics and computer science. Over the past year, I have completed two interesting courses in modern programming (Data Analyst and Machine Learning Engineer). Now I'm going to find a job and apply my knowledge.
What motivated you to share this dataset with the community on Kaggle?
Two very well-known datasets (handwritten figures and letters of the English alphabet) are widely used to teach programming skills. It was interesting for me to create a similar set of Russian letters and assess how much more difficult it is for processing and classifying.
What have you learned from the data?
For me, it was surprising how colors and backgrounds influence the recognition of the main object by algorithms. It seems to me it will be not so easy to improve the accuracy of classifying this data. I have already learned a lot about this and will continue to discover problems.
What questions would you love to see answered or explored in this dataset?
Using this database, we can explore a very wide range of questions in image recognition.
The advantages of this set are absolute realism (the letters are simply written by hand and photographed), a large range of colors, several different backgrounds.
So, this data allows conducting research in many areas:
- find a way to improve the classification accuracy;
- determine how the background and color decrease recognition;
- discover how well images are generated by algorithms based on real ones.
This database (and questions about it) can be expanded in several directions:
- add images with more backgrounds,
- add a sufficient number of capital letters and assess the deterioration of forecasting,
- find another person to write the same letters and try to classify their personal handwriting.
Third Place, Darknet Market Cocaine Listings by David Skip Everling
Can you tell us a little about your background?
My name is David Everling (aka Skip)! I'm a jack-of-all-trades data scientist who loves big ideas and creative engineering.
I studied Information Systems at Carnegie Mellon University in Pittsburgh, PA. I now live in the SF Bay Area (about 10 years), and I have been fortunate to work with prestigious tech companies like Google, Palantir, and Segment. I also spent two years as a neuroimaging researcher at Stanford University. I love to collaborate with smart, data-driven teams.
Currently I'm looking for opportunities to join a team of data scientists in San Francisco on a full-time basis. More about me on LinkedIn.
What motivated you to share this dataset with the community on Kaggle?
Megan from Kaggle saw a tweet from David Robinson about my project, and she suggested that I upload the dataset to Kaggle to share my work. I thought it was a good idea and agreed! I had no idea that it would qualify for a prize.
What have you learned from the data?
This was a fascinating dataset! I chose to scrape cocaine listings because that drug is easily quantifiable and can be compared across offerings.
The data makes plain how drugs are both wholesale and retail goods in digital marketplaces. They have economic patterns and competition just like traditional Internet retailers on Amazon. You can shop for deals on cocaine just like you shop for deals on a new mattress.
Cocaine sales follow particular geographic patterns that depend on factors like shipping connections and border control at the countries of origin and destination. Cocaine costs the most to order to Australia by a wide margin. The region selling the most cocaine internationally on this market seems to be northern central Europe centered around the Netherlands.
Because real-world identity is anonymized, trust is always a concern between parties on the dark web. As such, vendor ratings (not just product ratings) are among the most important features of a listing. If you are not a trusted vendor with corroborated transactions, few will risk buying from you even if you undercut prices. Therefore vendors have to curate their dark web identities for trust and reliability. New vendors might have to list "freebies" to attract buyers.
As a market average not controlling for local factors and sales, 100% pure cocaine costs a bit under $100 USD per gram.
You can read more about the data insights in my post on Medium.
What questions would you love to see answered or explored in this dataset?
It would be very interesting to see a more thorough exploration of vendor pricing schemes. For example: Do cocaine vendors use the same kind of bulk discounts and promotional sales as "clear web" retailers? How do new sellers attract buyers?
I collected vendor ratings and number of successful transactions, but haven't had time to explore those. How does a vendor's rating affect their prices? Does whether a vendor offers escrow affect their listings?
What other patterns are present in the product's text string? In the dataset I have already extracted price and quality, but there are other potentially meaningful signifiers present. For example, the words "uncut", "sample", or "Colombian" may each have an impact on the listing. These could become new features.
Which countries are the biggest cocaine exporters in this market? How are real-world cocaine markets *not* reflected in this dataset?
Can we visualize the market from this dataset?
Feel free to adapt any or all of the code I wrote to process the data. You can find it here on Github!
December Winners:
First Place, Breast Histopathology Images by Paul Mooney
Can you tell us a little about your background?
My graduate research demanded that I quantitatively analyze large datasets of digital images that were acquired using fluorescence microscopy. In order to facilitate the statistical analysis of these large datasets, I frequently worked with scripting languages such as MATLAB and ImageJ Macro, and I took courses and pursued independent projects using both Python and Octave. Currently, I am inspired by the use of Python for applications such as Predictive Analytics, Machine Learning, and Data Science, and I have found that the Kaggle platform provides an excellent arena for my continued education.
What motivated you to share this dataset with the community on Kaggle?
I am interested in biomedical data, and I like to use the Kaggle platform to experiment with open-access biomedical datasets. The NIH does fantastic work to support and maintain numerous open-access data repositories (https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html), and crowd-sourced data analysis platforms are a promising tool that can be used to extract new insights and make new discoveries from this important data.
What have you learned from the data?
Convolutional networks can be used to identify diseased tissue and score disease progression. Advancements in deep learning algorithms are a promising new hope in the fight against cancer -- and the Kaggle Kernel is a great platform to test out new deep learning approaches (https://www.kaggle.com/paultimothymooney/predict-idc-in-breast-cancer-part-two).
What questions would you love to see answered or explored in this dataset?
Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce error. In the future it will be interesting to see how deep learning approaches can be used to improve this diagnostic task as well as improve other diagnostic tests in other clinical settings. The Kaggle platform is a powerful tool for developing computational methods in modern medicine, and open-access datasets just add fuel to the flame of new discovery.
Second Place, Historical Hourly Weather Data, 2012 to 2017 by SelfishGene
Can you tell us a little about your background?
What motivated you to share this dataset with the community on Kaggle?
What have you learned from the data?
What questions would you love to see answered or explored in this dataset?
Third Place, Darknet Marketplace Data by Philip James
Can you tell us a little about your background?
Right now I’m a junior at Fordham University majoring in Computer Science and minoring in Mathematics. I’ve actually only been a CS major for about 6 months, but I’ve found it to be something that I naturally excel in, care deeply about, and love expanding my knowledge upon.
Most recently I’ve been doing some self-learning on machine learning and statistical analysis to satisfy my personal curiosities and goals, but I’ve also been doing some really cool research over at Fordham! At the moment I’m working on two separate projects concurrently, one dealing with computer vision, and the other with wireless sensor efficiency and placement. You can find more details here on my Linkedin!
What motivated you to share this dataset with the community on Kaggle?
It was just a “happy accident,” as Bob Ross would say. I was scouring the web to find some datasets and/or machine learning competitions when I happened to stumble upon Kaggle. After exploring the really fantastic datasets people had contributed, I realized I had just finished up a dataset of my own that could be really fun to mess around with, so I decided to share it!
What have you learned from the data?
Most prominently, I learned the extent of the trade of goods and services on the dark web. It’s astonishing to see the sheer volume and diversity of things being sold that aren’t available through legal channels. Perhaps one the the most interesting things I found was everyday items, such as magazine subscriptions, being sold on the same marketplace that contained highly illegal goods.
Brooks made some really fantastic visuals related to the dataset that I definitely recommend checking out here. They really help visualize the data wonderfully.
What questions would you love to see answered or explored in this dataset?
Honestly, there’s so many I don’t know where to start. I think it would be really neat to see competition between vendors by comparing items in certain price categories, or perhaps even just trying to find if there are any correlations between price and vendor rating. Maybe certain regions sell more of a particular kind of item, or simply see if some seller dominates some niche. The possibilities are quite extensive with a little bit of imagination!