#### Goal

This is the second of a series of diary entries (the first one is here) " in which I gradually aim at explaining some aspects of the paper Minimization and estimation of the variance of prediction errors for cross-validation designs with Norbert Krautenbacher. It is freely available here.Well, we reviewed the situation of the t-test and showed how find confidence intervals for the mean of a population and for the square of the mean. Here, we want to apply that knowledge to the situation where each observation is the result of a resampling procedure such as random cross-validation where one randomly selects a fixed size subset, say of size $g$, applies the learning procedure to that random subset, and uses the remaining observations for testing.

The most important theorem to apply in this situation is the law of total variance that allows to treat the variance due to random subsampling differently from that attributed to the randomness in the data themselves.

So, let us enjoy the moment of contemplation where we just write out the law of total variance in its entire beauty: $$ \begin{split} \mathbb{V} \bigl(U(\mathcal{S}) \bigr) &= \mathbb{E} \bigl( \mathbb{V} \bigl(U(\mathcal{S}) | (X, Y) \bigr)\bigr) + \mathbb{V} \bigl( \mathbb{E} \bigl(U(\mathcal{S}) | (X, Y) \bigr)\bigr)\\ & = \mathbb{E} \bigl( \mathbb{V} \bigl(U(\mathcal{S}) | (X, Y) \bigr)\bigr) + \mathbb{V} ( \star ) \end{split} $$ where:

- $\mathcal{S}$ is a random collection of learning subsets and $U(\mathcal{S})$ is shorthand for the average error on these learning sets, testes for each learning set on the remaining observations. This is an incomplete $U$-statistic, hence the notation.
- $(X, Y)$ is the collection of all data. $X$ denotes the features, and $Y$ denotes the outcome or class label.
- $\star$ is the complete leave-$p$-out estimator of the error rate.

*conditional on the observed data*is exactly equal to the leave-$p$-out estimator. Indeed, this is because the expectation is taken with respect to random sampling which is just the discrete uniform distribution on the collection of all learning subsets of size $g$. The size of that collection is just the binomial coefficient $\binom{n}{g}$.

What is the first term? Well, we have to understand the inner term $ \mathbb{V} \bigl(U(\mathcal{S}) | (X, Y)\bigr)$. This is just the variance that can be estimated by repeating the subsampling procedure a few times. In each round, we draw the same number of learning sets, of the same set each. We average the error across the learning sets, and finally take the estimated variance across these rounds. Definitely doable! We must just not confuse what is actually going on: We draw say 10 learning sets of the same size each (for instance half of all observations). We average the 10 resulting errors out. Then we repeat drawing these 10 learning sets. Then we take the variance estimator across these repetition. Alright?

How do we estimate the other term, $\mathbb{V} ( \star )$? Well, this variance is $\mathbb{E}(\star^2) - \bigl(\mathbb{E}(\star) \bigr)^2$ where $\mathbb{E}\star$ is our $\theta$, the actual true unkown error rate. And this is where the first blog post comes in! Indeed, when we have $n$ values $x_1, \dots, x_n$, the best estimator for the square of the mean is not $$ \biggl(\frac{1}{n}\sum x_i\biggr)^2 $$ but rather $$ \biggl(\frac{1}{n}\sum x_i\biggr)^2 - \frac{1}{n(n-1)}\biggl(\sum (x_i - \bar{x}) \biggr) $$ Now, we just have to find an unbiased estimator of $\bigl(\mathbb{E}(\star) \bigr)^2$ but that's not so hard: we just have to average out products of two errors estimated on non-overlapping sets. And these are all the ingredients needed for the correct error estimator on random learning subsets!