Estimation and Bias-Variance
estimator, bias, variance, mean squared error, sampling
1 Role
This page is the first real inference page in the statistics module.
Its job is to explain how we turn data into a rule for estimating an unknown population quantity, and how we judge whether that rule is good.
2 First-Pass Promise
Read this page after Descriptive Statistics and Data Models.
If you stop here, you should still understand:
- what an estimator is and how it differs from an estimate
- what bias and variance measure
- why low variance alone is not enough
- why mean squared error is the simplest way to balance bias and variance
3 Why It Matters
Statistics is not just about computing a sample mean and moving on.
The real question is: if you used the same estimation procedure on many fresh samples, how would it behave?
That matters immediately in practice:
- a benchmark average can fluctuate a lot across random seeds
- a heavily regularized model can become stable but systematically off-target
- a summary statistic can look precise while still being biased by the data-collection process
- an estimator can be unbiased but so noisy that it is not actually useful
Bias and variance give names to two different failure modes. Once you can separate them, statistical arguments become much clearer.
4 Prerequisite Recall
- a
parameteris a population quantity you want to learn about - a
statisticis a quantity computed from a sample - expectation describes average behavior across repeated sampling
- variance measures how much a random quantity fluctuates
5 Intuition
An estimator is a rule.
It takes in data and outputs a guess for an unknown parameter. If the data change, the output changes too, so the estimator itself is a random quantity before the sample is observed.
That immediately creates two natural questions:
On average, does the estimator point at the right target?How much does the estimator jump around from sample to sample?
The first question is about bias. The second is about variance.
Those are not the same problem. You can make an estimator extremely stable by always reporting the same number, but then its bias may be awful. You can also make an estimator unbiased but very noisy. Good estimation is about balancing both.
6 Formal Core
Definition 1 (Estimator and Estimate) Let \(\theta\) be an unknown population parameter.
An estimator of \(\theta\) is a statistic \[
\hat{\theta} = T(X_1,\dots,X_n)
\] computed from the sample.
After the data are observed, the realized numerical value of \(\hat{\theta}\) is called the estimate.
Definition 2 (Bias and Variance) The bias of an estimator \(\hat{\theta}\) for parameter \(\theta\) is \[
\operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta.
\]
The estimator is unbiased if \[
\mathbb{E}[\hat{\theta}] = \theta.
\]
Its variance is \[
\operatorname{Var}(\hat{\theta}),
\] which measures how much the estimator changes across repeated samples.
Proposition 1 (Mean Squared Error Decomposition) For scalar parameter estimation, the mean squared error is \[ \operatorname{MSE}(\hat{\theta}) = \mathbb{E}\big[(\hat{\theta}-\theta)^2\big]. \]
It decomposes as \[ \operatorname{MSE}(\hat{\theta}) = \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2. \]
This identity makes the bias-variance tradeoff precise: reducing one term can increase the other.
7 Worked Example
Suppose \(X_1,\dots,X_n\) are independent Bernoulli\((p)\) observations, where \(p\) is the unknown success probability.
A natural estimator of \(p\) is the sample proportion \[ \hat{p}=\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i. \]
7.1 Bias of \(\hat{p}\)
Because \(\mathbb{E}[X_i]=p\), \[ \mathbb{E}[\hat{p}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n}(np) = p. \]
So \(\hat{p}\) is unbiased.
7.2 Variance of \(\hat{p}\)
Because the \(X_i\) are independent and each has variance \(p(1-p)\), \[ \operatorname{Var}(\hat{p}) = \operatorname{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2}(np(1-p)) = \frac{p(1-p)}{n}. \]
So repeated samples center correctly around \(p\), and the spread shrinks as \(n\) grows.
7.3 A Biased but More Stable Competitor
Now consider \[ \tilde{p} = \frac{\hat{p}+1/2}{2}. \]
Then \[ \mathbb{E}[\tilde{p}] = \frac{\mathbb{E}[\hat{p}] + 1/2}{2} = \frac{p+1/2}{2}, \] so \(\tilde{p}\) is generally biased.
But its variance is \[ \operatorname{Var}(\tilde{p}) = \operatorname{Var}\left(\frac{\hat{p}+1/2}{2}\right) = \frac{1}{4}\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{4n}, \] which is smaller than the variance of \(\hat{p}\).
This is the tradeoff in one line:
- \(\hat{p}\) has zero bias and larger variance
- \(\tilde{p}\) has nonzero bias and smaller variance
The better choice depends on the full mean squared error and the context, not on one slogan like “always prefer unbiased estimators.”
8 Computation Lens
When you meet an estimator, a strong first checklist is:
- identify the parameter \(\theta\)
- write the estimator as an explicit function of the sample
- compute or approximate \(\mathbb{E}[\hat{\theta}]\)
- compute or approximate \(\operatorname{Var}(\hat{\theta})\)
- combine them through MSE if you need an overall error metric
- ask what changes as sample size \(n\) grows
This is especially useful in ML and simulation settings, where repeated runs naturally expose both bias and variability.
9 Application Lens
Bias-variance language appears in several nearby forms:
- in classical statistics, when comparing point estimators
- in regression and prediction, where more flexible models can reduce bias but increase variance
- in experimental evaluation, where a summary across seeds may be unbiased but too noisy to support a strong claim
- in regularized methods, where intentional bias is often introduced to reduce instability
So this page is not just about introductory formulas. It is the first clean version of a tradeoff that keeps returning later in model selection, learning theory, and empirical science.
10 Stop Here For First Pass
If you can now explain:
- what an estimator is
- why an estimator is random before the data are fixed
- what bias and variance each measure
- why MSE combines them into one error notion
then this page has done its main job.
11 Go Deeper
The most useful next steps after this page are:
- Maximum Likelihood and Bayesian Basics, to see two major estimator design philosophies
- Confidence Intervals and Hypothesis Testing, to move from point estimation to uncertainty statements
- Law of Large Numbers and CLT if you want the probability-side story behind estimator stabilization
12 Optional Paper Bridge
- Penn State STAT 500 Lesson 5: Confidence Intervals -
First pass- strong official open lesson on how estimation turns into practical inference for means and proportions. Checked2026-04-24. - Penn State STAT 415 Lesson 1: Point Estimation -
Second pass- official math-stat treatment of estimators, bias, and unbiasedness. Checked2026-04-24. - MIT 6.041SC Lecture 23 -
Paper bridge- official MIT notes that show bias-variance and MSE as design criteria for estimators. Checked2026-04-24.
13 Optional After First Pass
If you want more practice before moving on:
- compare two estimators of the same parameter and ask which has smaller variance
- construct a deliberately biased estimator and compute its bias
- ask whether a model evaluation procedure is noisy, biased, or both
14 Common Mistakes
- confusing an estimator with its realized estimate
- treating “unbiased” as automatically meaning “best”
- looking only at variance and ignoring systematic error
- forgetting that bias and variance are properties across repeated sampling, not just one dataset
- using bias-variance language in prediction without first understanding it for point estimation
15 Exercises
- Let \(X_1,\dots,X_n\) be i.i.d. with mean \(\mu\). Show that \(\bar{X}\) is an unbiased estimator of \(\mu\).
- Suppose an estimator always returns the constant value \(7\) for parameter \(\theta\). What are its variance and bias?
- In words, explain why a slightly biased estimator can still have lower mean squared error than an unbiased one.
16 Sources and Further Reading
- Penn State STAT 500 Lesson 5: Confidence Intervals -
First pass- official applied-statistics bridge from sample summaries to estimation and inference. Checked2026-04-24. - Penn State STAT 415 Lesson 1: Point Estimation -
First pass- official source with a clear mathematical treatment of point estimators and unbiasedness. Checked2026-04-24. - MIT 18.05 Introduction to Statistics -
Second pass- official MIT reading on the move from probability models to parameter inference. Checked2026-04-24. - MIT 6.041SC Lecture 23 -
Paper bridge- official MIT notes emphasizing estimator quality through bias, variance, and MSE. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Penn State STAT 500 Lesson 5
- Penn State STAT 415 Lesson 1
- MIT 18.05 Introduction to Statistics
- MIT 6.041SC Lecture 23