Bosch's competition, which ran from August to November 2016, challenged Kagglers to predict rare manufacturing failures in order to improve production line performance. While the challenge was ongoing, participants had the opportunity to submit research papers based on the competition to the Symposium for Advanced Manufacturing at the 2016 IEEE International Conference on Big Data.
Based on peer review by experts in the field, three teams were chosen to receive $2,000 travel grants to present their work at the symposium in December in Washington D.C. In this blog post we congratulate Ankita Mangal and Nishant Kumar, Abhinav Maurya, and Bohdan Pavlyshenko for their awards and they share their approaches to the competition plus the research they presented at the symposium.
Ankita Mangal & Nishant Kumar
What was your background prior to entering this challenge?
Nishant: I am currently a Data Scientist at Uber and also a Kaggle Master. I am a Mechanical Engineer turned Data Scientist and I have enjoyed working in the field of Machine Learning. I started with Kaggle 4 years ago and it has helped me a lot in improving my Machine Learning skills.There are a lot of techniques from Kaggle competitions which I use in my professional life at Uber. Uber is a great place to work, where we solve challenging problems like any other kaggle competition. We are currently hiring passionate Data Scientists/Engineers. My LinkedIn profile can be found here.
Ankita: I am a doctoral candidate in Materials Science & Engineering at Carnegie Mellon.
I am particularly interested in bringing together the field of data science and ICME(Integrated Computational Materials Engineering). I identify useful concepts from the field of machine learning, network analysis, structure mining and apply them to solving material science problems. I got interested in this challenge as it gave me a platform to apply skills from my current forte and prior experience in quality assurance management at Tata Steel Ltd. My LinkedIn profile can be found here.
Can you describe your approach in the Bosch competition?
The first thing we noticed was the dataset size (~14.3 Gb) and the large number of features (4265) of categorical (2140), numerical (968) and timestamp (1157) kind. The categorical features consisted of both single and multi-class labels and hence utilizing them via one-hot encoding presented it’s own problems because of the increased dimensionality of the feature space. Hence, we decided to use an online learning model with feature hashing implemented via Follow the Regularized Leader (FTRL) algorithm.
We divided the training dataset into two parts, and trained an online learning model on each part using categorical features only. Next, we used the model trained on one fold, to predict on the other fold, and constructed a probability column to be used as a feature in the next steps. This way, we have captured the information from the 2140 categorical features in just one numerical feature.
Then, we stacked this probability output from categorical features along with the remaining numerical and timestamp features, and used the Extreme Gradient boosting classifier to find the top 200 most important features. A final XGBoost model was then trained on these 200 features to get the final predictions.
Tell us about the paper you presented at the Symposium for Advanced Manufacturing
The Symposium gave us an unique opportunity to share our approach with other Kagglers, as well as a broader audience interested in the field of smart manufacturing ranging from company representatives to full time researchers. To cater to this audience, we explored the anonymized dataset to find insights about the assembly line and presented a picture of what’s happening at the shop floor. Every assembly line will have certain production flows, and to gain insights into this, we used the information contained in the feature names. We found out that the assembly line consists of 51 stations distributed between 4 production lines. Each station has different number of parts passing through it, which could mean the existence of different classes of products.
By comparing the number of measurements taken, defective product rate and the number of products passing through each station, we came to the conclusion that one of the stations ( number 32) is probably a re-processing or post processing station because it has the highest error rate, very few products pass through it, and there is only one kind of measurement taken there. (As illustrated in the below figure, the Error Rate/Fraction for Station 32 is the highest)
The timestamp features were also anonymized, and we calculated the autocorrelation between the number of products measured at each time unit as a function of time lag between them to understand the anonymized time units. We found out that the dataset consisted of measurements taken over 102.5 weeks and the measurements were recorded at a granularity of 6 minutes. Thus we can infer some structure about the timestamps from the anonymized features.
Using the machine learning techniques described above, we could build a model with an AUC of 0.716 and Matthews correlation coefficient of 0.23. This meant that this model could be used to tag products likely to fail resulting in a smarter failure detection problem, which is much better than checking for defects at random. With this model, out of the 1 million test samples, only 3000 are tagged likely to fail, which results in saving time and resources due to reduced product downgrading, increased salvage and higher production yields.
The final model showed that the most important features influencing product failure are:
- categorical probability
- time spent by a product in the production line
- if products belonged to the same batch (same timeline), and
- products passing through production line 3 (possibly because station 32 belongs to that).
The final model which we submitted ranked top 10% in the private leaderboard and included leakage (magic) features. These features use the fact that sequential components with same numerical/date readings will belong to a batch and hence have similar rate of failure. But this information is not available in real-time processes and hence we did not use them in this paper. Hence, the model mentioned in the paper can be applied at Bosch’s to reduce their failure rates. For more details, please refer to the paper here.
Abhinav Maurya
What was your background prior to entering this challenge?
I am a PhD student in Information Systems at Carnegie Mellon University. My primary research interests are machine learning, data science, Bayesian statistics, and deep learning. I like designing and developing machine learning methods that scale to massive datasets, are easily interpretable by users, and help bridge the prediction-decision gap in machine learning by providing actionable insights into the data. My past research projects include a diverse set of socially relevant problems tackled through the lens of Bayesian statistics. My LinkedIn profile can be found here.
Can you describe your approach in the Bosch competition?
In the Kaggle challenge, our goal was to detect if a manufactured part suffers from internal defects, based on sensor measurements from the assembly lines during the manufacturing process. There are two possible approaches to tackle this problem: Anomaly Detection and Binary Classification. We adopted the second approach in order to utilize anomaly supervision since the dataset contained datapoints that were specifically marked as anomalous.
Since internal defect rates using modern manufacturing processes are low due to excellent statistical quality control, the resulting dataset is often highly imbalanced with very few anomalous, positive datapoints. In order to deal with the severe imbalance in the number of positive and negative datapoints, we chose to learn a weight parameter “w” that trades off between the losses incurred on the positive and negative datapoints. Specific to the Bosch challenge, our approach was to design a Gaussian Process-based meta-optimization algorithm that directly optimized the required metric of Matthew’s Correlation Coefficient (MCC) using Gradient Boosting Machine (GBM) as a base classifier. The following figure provides a schematic overview of our system:
Tell us about the paper you presented at the Symposium for Advanced Manufacturing
Predicting internal failures in manufactured products is a challenging machine learning task due to two reasons: (i) the rarity of such failures in modern manufacturing processes, and (ii) the failure of traditional machine learning algorithms to optimize non-convex metrics such as Matthew’s Correlation Coefficient (MCC) used to measure performance on imbalanced datasets. In our paper, we presented “ImbalancedBayesOpt”, a meta-optimization algorithm that directly maximized MCC by learnings the optimal weights on the losses incurred on the positive and negative datapoints.
We used Gradient Boosting Machine (GBM) as the base classifier for our meta-optimization algorithm due to its competitive performance on machine learning prediction tasks. Using “ImbalancedBayesOpt”, we could significantly improve the classification performance of the base classifier on the severely imbalanced high-dimensional Bosch dataset for detecting rare internal manufacturing defects. Our presentation from the IEEE BigData 2016 conference can be found here.
Bohdan Pavlyshenko
What was your background prior to entering this challenge?
I work as a Data Scientist at SoftServe (Ukraine) and I am an associate professor (Ph.D.) at electronics and computer technologies faculty of Ivan Franko National University of Lviv (Ukraine). I have Master level at Kaggle. Our team ”The Slippery Appraisals” won the Grupo Bimbo Inventory Demand Kaggle competition. My current scientific areas are: Data Mining, Predictive Analytics, Supply Chain analysis, Machine Learning, Information Retrieval, Text Mining, Natural Language Processing, R Analytics, Social Network Analysis, Big Data; semantic field approach in the analysis of semi-structured data.
Can you describe your approach in the Bosch competition?
The main idea of our study in the Bosch Data Challenge was to show different approaches applying logistic regression to the problem of manufacturing failures detection. We considered the use of machine learning, linear and Bayesian models. The machine learning approach can give us the best-scored failure detection. The generalized linear model for the logistic regression makes it possible to investigate influence factors on the failure detection in the groups of manufacturing parts. Using Bayesian model, it is possible to receive the statistical distribution of model parameters, which can be used in the risk assessment analysis. Using 2-level models, we can receive more precise results. Using Bayesian model on the second level with the covariates that are the probabilities predicted by machine learning models on the first level, makes it possible to take into account the differences in results for machine learning models received for different sets of parameters and subsets of samples in case of highly imbalanced classes.
In our work, we did not invest a lot of time into features construction and selection due to the fact that features were anonymized, so we cannot apply different models of features interaction which are based on the domain of data. So, high scores of logistic regression was not our goal. As it is known, during competition so-called magic features were found, based on the samples ID which improve scoring essentially.
- URL:https://www.kaggle.com/mmueller/bosch-production-line-performance/road-2-0-4, The Magic Feature : from LB 0.3-to 0.4+.
- URL: https://www.kaggle.com/c/bosch-production-line-performance/forums/t/24065/themagical-feature-from-lb-0-3-to-0-4). In our presentation, we made calculations for 2 sets of features – set 1 contained the most important features and set 2 was the set 1 with added 4 magic features.
Tell us about the paper you presented at the Symposium for Advanced Manufacturing
On the Bosch Production Line Performance Kaggle competition, Bosch invited participants to apply for one of three travel stipends to attend the conference and present a research from their work in this competition (https://www.kaggle.com/c/bosch-production-line-performance/details/ieee-bigdata-2016 ). I submitted my results of my scientific studies and according to review scores my paper was chosen for the symposium presentation and I won the travel grant to attend it (https://www.kaggle.com/c/bosch-production-line-performance/forums/t/25032/symposium-winners).
This symposium intended to provide a platform for researchers and industry practitioners from manufacturing, information science, and data science disciplines to share their data mining and big-data-analytics-related research results, and practical design or development experiences in the manufacturing industry.
At this symposium, the keynote speaker Dr. Rumi Ghosh had a very interesting speech “From Sensors to Sensing- Industrial Data Mining at Bosch. She informed the attendees about the Bosch Data Challenge on the Kaggle, and described the received results and problems concerning the analysis of data received from assembly lines during the manufacturing processes at Bosch.
At the symposium, I gave my talk “Machine Learning, Linear and Bayesian Models for Logistic Regression in Failure Detection Problems”. The main results from my speech are described in my article here. The conference and symposium were very interesting for me, there were many interesting talks, presentations and discussions. Special thanks to Bosch for organizing such an interesting Kaggle competition, Bosch Production Line Performance, and for awarding me the travel grant for attending the IEEE BigData 2016 conference!