# suRprise! - Classifying Kinder Eggs by Boosting

## Abstract

Carrying the Danish tradition of Juleforsøg to the realm of statistics, we use R to classify the figure content of Kinder Eggs using boosted classification trees for the egg’s weight and possible rattling noises.

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License. The
markdown+Rknitr source code of this blog is available under a GNU General Public
License (GPL v3) license from github.

## Introduction

A **juleforsøg** is the kind of exploding
experiment happening in the last physics or chemistry class before
the Christmas vacation. Not seldomly the teacher, with a look of
secrecy, initializes the class by locking the door mumbling something
like “the headmaster better not see this…”. With Christmas approaching
fast, here is an attempt to create a statistical juleforsøg concluding
the *Theory meets practice* 2016 posting season:

The advertisement campaign of the Kinder Surprise Eggs aka. Kinder Joy claims that the content of every 7th egg is a figure (see example) - otherwise they contain toys or puzzles, which positively can be described as junk. Figures, in particular completed series, on the other hand, can achieve high trading values. The clear goal is thus to optimize your egg hunting strategy in order to maximize figure content.

## The problem

Your budget is limited, so the question is which egg to select when standing in the supermarket?

Photo: Price in SEK per egg in a Swedish supermarket. The red ellipse shows the price per kg.

### Various egg selection strategies

It goes without saying that brute force purchasing strategies would be insane. Hence, a number of egg selection strategies can be observed in real life:

The no clue egg enthusiast: Selects an egg at random. With a certain probability (determined by the producer and the cleverness of the previous supermarked visitors) the egg contains a figure

The egg junkie: knows a good radiologist

The egg nerd: using scale, rattling noises and the barcode he/she quickly determines whether there is a figure in the egg

We shall in this post be interested in **the statistician’s egg
selection approach**: Egg classification based on weight and
rattling noise using ‘top-notch’ machine learning algorithms - in our
case based on boosted classification trees.

## Data Collection

We collected n=79 eggs of which 43.0% were figures - the data are
available under a GPL v3.0 license from github.
For each egg, we determined its **weight** as well as the
sound it produced when being shaken. If the sounds could be
characterized as **rattling** (aka. clattering) this was
indicative of the content consisting of many parts and, hence, unlikely
to be a figure.

Altogether, the first couple of rows of the dataset look as follows.

`head(surprise, n=5)`

```
## weight rattles_like_figure figure rattles rattles_fac figure_fac
## 1 32 1 0 0 no no
## 2 34 0 1 1 yes yes
## 3 34 1 1 0 no yes
## 4 30 1 0 0 no no
## 5 34 1 1 0 no yes
```

### Descriptive Data Analysis

The fraction of figures in the dataset was 34/79, which is way higher than the proclaimed 1/7; possibly, because professionals egg collectors were at work…

Of the 79 analysed eggs, 54 were categorized as non-rattling. The
probability of such a non-rattling egg really containing a figure was
51.9%. This proportion is not impressive, but could be due to the data
collector’s having a different understanding of exactly how the variable
*rattling* was to be interpreted: Does it *rattle*, or
does it *rattle like a figure*? In hindsight, a clearer
definition and communication of this variable would have prevented
ambiguity in the collection.

A descriptive plot of the weight distribution of eggs with and without figure content shows, that eggs with figures tend to be slightly heavier:

Note: The first approximately 50% of the eggs were weighted on a standard supermarket scales, which showed the resulting weight in even steps of 2g only.

Below the proportion (in %) of eggs with figure content per observed weight:

```
## weight
## figure_fac 26 28 29 30 31 32 33 34 35 36 40
## no 100.0 50.0 66.7 53.3 71.4 72.7 75.0 25.0 100.0 33.3 0.0
## yes 0.0 50.0 33.3 46.7 28.6 27.3 25.0 75.0 0.0 66.7 100.0
```

A simple selection rule based on weight would be to weigh eggs until you hit a 40g egg. A slightly less certain stopping rule would be to pick 34g eggs. However, modern statistics is more than counting and analysing proportions!

## Machine Learning the Egg Content

We use machine learning algorithms to solve the binary classification
problem at hand. In particular, we use the `caret`

package
(Kuhn 2016) and
classify figure content using boosted classification trees as
implemented in the `xgboost`

package (Chen et al.
2016). Details on how to use the `caret`

package can,
e.g., be found in the following tutorial.

```
library(caret)
##Grid with xgboost hyperparameters
= expand.grid(
xgb_hyperparam_grid nrounds = c(25, 50, 100, 250, 1000),
eta = c(0.01, 0.001, 0.0001),
max_depth = seq(2,16,by=2),
subsample = c(0.4,0.5,0.6),
gamma = 1, colsample_bytree = 0.8, min_child_weight = 1
)##caret training control object
<- trainControl(method="repeatedcv", number=8, repeats=8, classProbs=TRUE,
control summaryFunction = twoClassSummary, allowParallel=TRUE)
##train away and do it parallelized on 3 cores...
library(doMC)
registerDoMC(cores = 3)
<- train( figure_fac ~ weight * rattles_fac, data=surprise, method="xgbTree",
m_xgb trControl=control, verbose=FALSE, metric="ROC", tuneGrid=xgb_hyperparam_grid)
##look at the result
m_xgb
```

```
## eXtreme Gradient Boosting
##
## 79 samples
## 2 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (8 fold, repeated 8 times)
## Summary of sample sizes: 69, 70, 69, 69, 69, 68, ...
## Resampling results across tuning parameters:
##
## eta max_depth subsample nrounds ROC Sens Spec
## 1e-04 2 0.4 25 0.6661328 0.8661458 0.4328125
## 1e-04 2 0.4 50 0.6657943 0.8661458 0.4359375
## 1e-04 2 0.4 100 0.6760938 0.8661458 0.4398437
## ... ... ...
## 1e-02 16 0.6 250 0.6769792 0.7901042 0.4210937
## 1e-02 16 0.6 1000 0.6578516 0.7364583 0.4335938
##
## Tuning parameter 'gamma' was held constant at a value of 1
## Tuning
## parameter 'colsample_bytree' was held constant at a value of 0.8
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 250, max_depth = 6, eta = 0.01,
## gamma = 1, colsample_bytree = 0.8, min_child_weight = 1 and subsample = 0.4.
```

The average AUC for the 64 resamples is 0.69. Average sensitivity and specificity are 84.0% and 42.1%, respectively. This shows that predicting figure content with the available data is better than simply picking an egg at random, but no figure-guaranteeing strategy appears possible on a per-egg basis.

### Predicting the Content of a Particular Egg

Suppose the egg you look at weighs 36g and, when shaken, sounds like a lot of small parts being moved. In other words:

`predict(m_xgb, newdata = data.frame(weight=36, rattles_fac="yes"),type="prob")`

```
## no yes
## 1 0.4329863 0.5670137
```

Despite the rattling noises, the classifier thinks that it’s slightly more likely that the content is a figure. However, when we opened this particular egg:

…a car. Definitely not a figure! The proof of concept disappointment was, however, quickly counteracted by the surrounding chocolate…

As a standard operating procedure for your optimized future
supermarket hunt, below are shown the classifier’s predicted
probabilities for figure content as a function of egg weight and the
`rattles_fac`

variable.

## Discussion

The present post only discusses the optimal selection on a per-egg basis. One could weight & shake several eggs and then select the one with the highest predicted probability for containing a figure. Future research is needed to solve this sequential decision making problem in an optimal way.

### Outlook

We have retained a validation sample of 10 eggs and are willing to send an unconsumed 11th element of the sample to whoever obtains the best score on this validation sample. Anyone who knows how to upload this to kaggle?

*God jul*and a happy new year!

## Acknowledgments

Thanks to former colleagues at the Department of Statistics, University of Munich, as well as numerous statistics students in Munich and Stockholm, for contributing to the data collection. In particular we thank Alexander Jerak for his idea of optimizing figure hunting in a data driven way more than 10 years ago.

## Literature

*Xgboost: Extreme Gradient Boosting*. https://CRAN.R-project.org/package=xgboost.

*Caret: Classification and Regression Training*. https://CRAN.R-project.org/package=caret.