Some thoughts on complete cross-validation in supervised machine learning and its approximation by random learning subsets, part 1: how to estimate the mean and its square

Goal

This is the first of a series of diary entries in which I gradually aim at explaining some aspects of the paper Minimization and estimation of the variance of prediction errors for cross-validation designs with Norbert Krautenbacher. It is freely available here.
So, what is all that about?

Reviewing the simplest situation: estimation of a population mean

Let us start with the easiest and simplest of all resampling situations. It is still worthwile and interesting to review a few simple formulas, for reference, for entertainment, and our edification: the situation where one draws independently numbers from a population and is just interested in their mean or average. Not too complicated, huh?
Ok, so assume we draw $n$ numbers from the population which may or not have less or more than $n$ members or be infinite. All that matters is that we draw independently, both in the intuitive and the mathematically rigorous sense.
The sample is denoted by $x_1, \dots, x_n$ as usual. Quite often one hears the wrong notion where each single $x_i$ is called a sample. Actually, all the $x_i$ together form the sample, and each $x_i$ is an observation. Accordingly, the number $n$ is the sample size.
The sample mean is then, of course, given by $$ \hat{m} = n^{-1}\sum x_i. $$ Just like the $x_i$, $\hat{m}$ is a random variable. Let $m$ denote the true sample mean. To be precise, it is defined by the Lebesgue integral $$ m = \int_{-\infty}^\infty x\ dP(x) $$ where $P$ is now to be thought of as a measure but well ...
Now, the following are true:

The expected value of $\hat{m}$ is $m$.
The variance of $\hat{m}$ is $n^{-1}\mathbb{V}(X)$ where $X$ is a "generic" random variable whose distribution is that of the $x_i$.
The best (in some sense) estimator of that variance $\mathbb{V}$ of $X$ is the sample variance $(n-1)^{-1}\sum (x_i - \hat{m})^2$.
Therefore, simple plugging in teaches us that the best estimator of the variance of our sample mean $\hat{m}$ is $(n(n-1))^{-1}\sum (x_i - \hat{m})^2$.
The estimated standard deviation of the sample mean is the square root of that. It's not universally unbiased because there exists no universally unbiased estimator but that's not such a big problem for us.
Accordingly, a good confidence interval for unkown parameter $m$ at a reasonable level $\alpha$ is "mean plus/minus magic number times standard deviation" $$ \biggl[ \hat{m} - t\sqrt{\frac{\sum (x_i - \hat{m})^2}{n(n-1)}}, \hat{m} + t\sqrt{\frac{\sum (x_i - \hat{m})^2}{n(n-1)}} \biggr]. $$

So, to keep simple things simple, let me throw a dice ten times, say $2-2-4-6-2-6-4-5-4-6$. We get the following values:

$\hat{m}$ = 4.1
$m$ is the true mean. Here, we know that it is $7/2 = 3.5$ because we just know what a dice is.
What is the variance of $X$? Well, the expectation of the square is $(1^2 + \dots + 6^2)/6 = 91/6$, and the square of the expectation is $(7/2)^2 = 49/4$. So, $$ \mathbb{V}(X) = 91/6 - 49/4 = 2.91666\dots $$
The sample variance is 2.76666pretty close to the true value 2.91666.
The confidence interval for the mean at the level we use above is $$ \biggl[2.910125, 5.289875\biggr] $$ which does indeed include the true value 3.5.

But what if, for whatever reason, we want to estimate $m^2$ instead of $m$?

There might be a little temptation to say that if $\hat{m}$ estimates $m$, then its square estimates $m^2$. However, that optimator turns out to be wrong - a "false friend" - in the sense that it is biased. Instead, the best or unbiased estimator is something else!
Since $\mathbb{V}X = \mathbb{E}X^2 - (\mathbb{E}X)^2 = \mathbb{E}X^2 - m^2$, we could try to estimate $m^2$ by the average of the squared observations (which is not the squared average, i.e., the false friend) minus the sample variance. Indeed, that is the best estimator, but let us try to understand more closely what is going on.
In the dice example, $m^2$ is $3.5^2 = 12.25$. The "wrong estimator" on the grounds of the sample above would be $4.1^2 = 16.81$, and the "good one" turns out to be $$G=\frac{1}{\binom{n}{2}}\sum_{i < j}x_ix_j$$ - it has no non-mixed terms unlike the false friend! This can also be written as $$\frac{1}{n(n-1)} \sum_{i \ne j}x_ix_j.$$ which evaluates to 16.5333slightly closer to the truth!
What's going on here is that the expectations of the monomial non-mixed terms such as $x_1^2$ and so on are variance of $X$ plus the right term, so that this variance is "in the way".
Ok, what if we now want to have a confidence interval for the estimation of our $m^2$ with our $G$? Well, let us pursue the same recipe as above: Confidence interval is best estimator plus/minus magic number times square root of variance estimator by $n$. So, we need to estimate the variance of the "good" estimator. For that, we need a formula for it. We could use the "good friend is average squared minus sample variance" to derive a formula for its variance. But that would be a little clumsy since we would have to work out the covariance between both terms.
There is some amount of theory on variance estimation of $U$-statistics but for now let us follow a somewhat intuitive road. Let us call the good estimator $G$. Its variance is the expectation of its square minus the square of its expectation, again: $$\mathbb{V}G = \mathbb{E}G^2 - m^4.$$ We estimate the former by $G^2$ (which is unbiased), and the latter byhold on a sec, if we can estimate something, how do we estimate its square?oh wait, isn't that why we're looking at $G$ in the first place? Funnily, we have run into a recursive problem here. For the latter part of the variance of $G$, we need to be able to estimate the square of its expectation, which should not be done by the false friend estimator but by "its own good" friend. Now, what is its own good friend for the square $(m^2)^2 = m^4$ of the expectation $m^2$ of $G$?. Instead, looking at the nice formula for $G$ we can already guess that $$H = \frac{1}{\binom{n}{4}}\sum_{i < j < k < l}x_ix_jx_kx_l$$ is the right estimator for the square of the expectation of $G$. (Actually, $H$ is the $U$-statistic for the parameter $m^4$.) So, to put both terms together, a good estimator for the variance of $G$ is $$K = G^2 - H.$$ Indeed, $G^2$ is a good estimator for its expectation $\mathbb{E}G^2$, and $H$ is the good friend estimator for $m^4$. The confidence interval for $m^2$ that we seek is $$\biggl[G - t\sqrt{K}, G + t\sqrt{K}\biggr]$$.
Note that we don't divide by $n$ within the square root anymore. This is because that division by $n$ is already implicitly happening in both terms $G^2$ and $H$ in the binomial coefficients. The magic number $t$ is given as above, mildly depending on $n$. Let us plug in the example above. Recall that we know the true value $m^2 = 12.25$. Let's see if our confidence interval contains that number! We obtain

$G = 16.5333$
$H=254.6286$
The confidence interval is $\left[6.7451, 26.3216\right]$.

That does indeed contain the correct value $m^2=12.25$. Why is the confidence interval so big? Well, $n=10$ was a really small sample. What happens on $n=200$? We expect the confidence intervals width to scale with the inverse square root of $n$, so it should be about $\sqrt{20}$ i.e. $4.5$ times smaller. I threw a dice $n=200$ times (virtually, of course) and obtained a confidence interval $\left[9.37, 12.51\right]$. Everything worked out fine!
Ok, now for the adults: why do we do all that? Isn't that just textbook knowledge? On the one hand, it's always good to review the basics. On the other hand, the estimators $G, H$ and $K$ aren't even that standard, I'd argue. And finally, when we apply all that to a slightly different scenario namely to resampling in supervised learning we obtain some really important and interesting results! See the following diary entries for that.