Estimation and Bias-Variance

How sample-based estimators target population parameters, how bias and variance measure different kinds of estimation error, and why mean squared error balances them.
Modified

April 26, 2026

Keywords

estimator, bias, variance, mean squared error, sampling

1 Role

This page is the first real inference page in the statistics module.

Its job is to explain how we turn data into a rule for estimating an unknown population quantity, and how we judge whether that rule is good.

2 First-Pass Promise

Read this page after Descriptive Statistics and Data Models.

If you stop here, you should still understand:

  • what an estimator is and how it differs from an estimate
  • what bias and variance measure
  • why low variance alone is not enough
  • why mean squared error is the simplest way to balance bias and variance

3 Why It Matters

Statistics is not just about computing a sample mean and moving on.

The real question is: if you used the same estimation procedure on many fresh samples, how would it behave?

That matters immediately in practice:

  • a benchmark average can fluctuate a lot across random seeds
  • a heavily regularized model can become stable but systematically off-target
  • a summary statistic can look precise while still being biased by the data-collection process
  • an estimator can be unbiased but so noisy that it is not actually useful

Bias and variance give names to two different failure modes. Once you can separate them, statistical arguments become much clearer.

4 Prerequisite Recall

  • a parameter is a population quantity you want to learn about
  • a statistic is a quantity computed from a sample
  • expectation describes average behavior across repeated sampling
  • variance measures how much a random quantity fluctuates

5 Intuition

An estimator is a rule.

It takes in data and outputs a guess for an unknown parameter. If the data change, the output changes too, so the estimator itself is a random quantity before the sample is observed.

That immediately creates two natural questions:

  1. On average, does the estimator point at the right target?
  2. How much does the estimator jump around from sample to sample?

The first question is about bias. The second is about variance.

Those are not the same problem. You can make an estimator extremely stable by always reporting the same number, but then its bias may be awful. You can also make an estimator unbiased but very noisy. Good estimation is about balancing both.

6 Formal Core

Definition 1 (Estimator and Estimate) Let \(\theta\) be an unknown population parameter.

An estimator of \(\theta\) is a statistic \[ \hat{\theta} = T(X_1,\dots,X_n) \] computed from the sample.

After the data are observed, the realized numerical value of \(\hat{\theta}\) is called the estimate.

Definition 2 (Bias and Variance) The bias of an estimator \(\hat{\theta}\) for parameter \(\theta\) is \[ \operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta. \]

The estimator is unbiased if \[ \mathbb{E}[\hat{\theta}] = \theta. \]

Its variance is \[ \operatorname{Var}(\hat{\theta}), \] which measures how much the estimator changes across repeated samples.

Proposition 1 (Mean Squared Error Decomposition) For scalar parameter estimation, the mean squared error is \[ \operatorname{MSE}(\hat{\theta}) = \mathbb{E}\big[(\hat{\theta}-\theta)^2\big]. \]

It decomposes as \[ \operatorname{MSE}(\hat{\theta}) = \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2. \]

This identity makes the bias-variance tradeoff precise: reducing one term can increase the other.

7 Worked Example

Suppose \(X_1,\dots,X_n\) are independent Bernoulli\((p)\) observations, where \(p\) is the unknown success probability.

A natural estimator of \(p\) is the sample proportion \[ \hat{p}=\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i. \]

7.1 Bias of \(\hat{p}\)

Because \(\mathbb{E}[X_i]=p\), \[ \mathbb{E}[\hat{p}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n}(np) = p. \]

So \(\hat{p}\) is unbiased.

7.2 Variance of \(\hat{p}\)

Because the \(X_i\) are independent and each has variance \(p(1-p)\), \[ \operatorname{Var}(\hat{p}) = \operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2}(np(1-p)) = \frac{p(1-p)}{n}. \]

So repeated samples center correctly around \(p\), and the spread shrinks as \(n\) grows.

7.3 A Biased but More Stable Competitor

Now consider \[ \tilde{p} = \frac{\hat{p}+1/2}{2}. \]

Then \[ \mathbb{E}[\tilde{p}] = \frac{\mathbb{E}[\hat{p}] + 1/2}{2} = \frac{p+1/2}{2}, \] so \(\tilde{p}\) is generally biased.

But its variance is \[ \operatorname{Var}(\tilde{p}) = \operatorname{Var}\left(\frac{\hat{p}+1/2}{2}\right) = \frac{1}{4}\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{4n}, \] which is smaller than the variance of \(\hat{p}\).

This is the tradeoff in one line:

  • \(\hat{p}\) has zero bias and larger variance
  • \(\tilde{p}\) has nonzero bias and smaller variance

The better choice depends on the full mean squared error and the context, not on one slogan like “always prefer unbiased estimators.”

8 Computation Lens

When you meet an estimator, a strong first checklist is:

  1. identify the parameter \(\theta\)
  2. write the estimator as an explicit function of the sample
  3. compute or approximate \(\mathbb{E}[\hat{\theta}]\)
  4. compute or approximate \(\operatorname{Var}(\hat{\theta})\)
  5. combine them through MSE if you need an overall error metric
  6. ask what changes as sample size \(n\) grows

This is especially useful in ML and simulation settings, where repeated runs naturally expose both bias and variability.

9 Application Lens

Bias-variance language appears in several nearby forms:

  • in classical statistics, when comparing point estimators
  • in regression and prediction, where more flexible models can reduce bias but increase variance
  • in experimental evaluation, where a summary across seeds may be unbiased but too noisy to support a strong claim
  • in regularized methods, where intentional bias is often introduced to reduce instability

So this page is not just about introductory formulas. It is the first clean version of a tradeoff that keeps returning later in model selection, learning theory, and empirical science.

10 Stop Here For First Pass

If you can now explain:

  • what an estimator is
  • why an estimator is random before the data are fixed
  • what bias and variance each measure
  • why MSE combines them into one error notion

then this page has done its main job.

11 Go Deeper

The most useful next steps after this page are:

  1. Maximum Likelihood and Bayesian Basics, to see two major estimator design philosophies
  2. Confidence Intervals and Hypothesis Testing, to move from point estimation to uncertainty statements
  3. Law of Large Numbers and CLT if you want the probability-side story behind estimator stabilization

12 Optional Paper Bridge

13 Optional After First Pass

If you want more practice before moving on:

  • compare two estimators of the same parameter and ask which has smaller variance
  • construct a deliberately biased estimator and compute its bias
  • ask whether a model evaluation procedure is noisy, biased, or both

14 Common Mistakes

  • confusing an estimator with its realized estimate
  • treating “unbiased” as automatically meaning “best”
  • looking only at variance and ignoring systematic error
  • forgetting that bias and variance are properties across repeated sampling, not just one dataset
  • using bias-variance language in prediction without first understanding it for point estimation

15 Exercises

  1. Let \(X_1,\dots,X_n\) be i.i.d. with mean \(\mu\). Show that \(\bar{X}\) is an unbiased estimator of \(\mu\).
  2. Suppose an estimator always returns the constant value \(7\) for parameter \(\theta\). What are its variance and bias?
  3. In words, explain why a slightly biased estimator can still have lower mean squared error than an unbiased one.

16 Sources and Further Reading

Sources checked online on 2026-04-24:

  • Penn State STAT 500 Lesson 5
  • Penn State STAT 415 Lesson 1
  • MIT 18.05 Introduction to Statistics
  • MIT 6.041SC Lecture 23
Back to top