Confidence Intervals and Hypothesis Testing
confidence interval, hypothesis testing, p-value, significance level, margin of error
1 Role
This page is the bridge from point estimation to formal statistical decisions.
Its job is to explain two closely related tools: confidence intervals, which summarize uncertainty around an estimate, and hypothesis tests, which decide whether the observed data are sufficiently incompatible with a null claim.
2 First-Pass Promise
Read this page after Maximum Likelihood and Bayesian Basics.
If you stop here, you should still understand:
- what a confidence interval is and how to interpret it correctly
- what a null hypothesis, alternative hypothesis, significance level, and p-value are
- why failing to reject is not the same as proving the null
- how a two-sided hypothesis test lines up with a corresponding confidence interval
3 Why It Matters
A huge amount of published quantitative work is really built from these two ideas.
They appear whenever someone reports:
- an error bar
- a margin of error
- a p-value
- “statistically significant”
- “not significantly different”
- a confidence band or uncertainty interval
If you do not understand what these mean, it becomes very easy to overread tables and plots:
- a narrow interval can be mistaken for certainty
- a small p-value can be mistaken for a large or important effect
- a non-significant result can be mistaken for evidence of no effect
- a confidence interval can be misread as a posterior probability statement
This page is meant to make those errors much harder.
4 Prerequisite Recall
- an estimator is a random quantity before the data are observed
- bias and variance describe repeated-sampling behavior of estimators
- a likelihood or model tells us how data would behave under parameter values or hypotheses
5 Intuition
Confidence intervals and hypothesis tests answer related but different questions.
A confidence interval asks:
which parameter values remain reasonably compatible with the data?
A hypothesis test asks:
if a specific null claim were true, would these data look too surprising?
So the interval is a range-style summary, while the test is a decision-style procedure.
They are often taught separately, but they live in the same repeated-sampling world. In common two-sided settings, the connection is especially clean:
- if the hypothesized value lies outside the \((1-\alpha)\) confidence interval, reject the corresponding two-sided null at level \(\alpha\)
- if it lies inside, fail to reject
That relationship helps keep the tools conceptually unified instead of feeling like two unrelated rituals.
6 Formal Core
Definition 1 (Confidence Interval) A \((1-\alpha)\) confidence interval for parameter \(\theta\) is a random interval \[ [L(X), U(X)] \] constructed from the sample such that, under the repeated-sampling interpretation, \[ \mathbb{P}\big(\theta \in [L(X), U(X)]\big) \approx 1-\alpha \] for the interval-generating procedure, or exactly \(1-\alpha\) in special exact constructions.
After the data are observed, the interval becomes a fixed numerical range.
Definition 2 (Hypothesis Test) A hypothesis test begins with:
- a null hypothesis \(H_0\)
- an alternative hypothesis \(H_A\)
- a significance level \(\alpha\)
The test uses the sample to compute a test statistic and then a p-value or rejection rule.
The p-value is the probability, assuming \(H_0\) is true, of obtaining data at least as extreme as what was observed in the direction of \(H_A\).
Proposition 1 (Confidence Intervals and Two-Sided Tests) In many standard one-parameter settings, a two-sided level-\(\alpha\) test of \[ H_0:\theta=\theta_0 \qquad \text{vs.} \qquad H_A:\theta\neq\theta_0 \] rejects exactly when \(\theta_0\) lies outside the corresponding \((1-\alpha)\) confidence interval.
This relation does not mean intervals and tests are identical, but it does mean they often summarize the same information in different forms.
7 Worked Example
Suppose a product team wants to estimate the fraction \(p\) of users who click a new recommendation module.
They observe \(n=100\) users and see \(x=62\) clicks, so \[ \hat{p}=\frac{62}{100}=0.62. \]
7.1 Confidence Interval
Using the usual large-sample standard error, \[ \operatorname{SE}(\hat{p}) \approx \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.62\cdot 0.38}{100}} \approx 0.0485. \]
A rough 95% confidence interval is \[ \hat{p} \pm 1.96\cdot \operatorname{SE}(\hat{p}) \] so \[ 0.62 \pm 1.96(0.0485) \approx 0.62 \pm 0.095. \]
That gives the interval \[ (0.525,\;0.715). \]
This means the interval-building procedure has 95% repeated-sampling coverage under its assumptions. It does not mean there is a 95% posterior probability that \(p\) lies inside this already computed interval.
7.2 Hypothesis Test
Now test \[ H_0:p=0.5 \qquad \text{vs.} \qquad H_A:p\neq 0.5. \]
Under \(H_0\), the standard error is \[ \sqrt{\frac{0.5(1-0.5)}{100}} = 0.05. \]
The z-statistic is \[ z = \frac{0.62-0.5}{0.05}=2.4. \]
A two-sided p-value for \(z=2.4\) is about \[ 0.016. \]
So at significance level \(\alpha=0.05\), we reject \(H_0\).
7.3 Relationship
Notice that the hypothesized value \(0.5\) does not lie in the 95% confidence interval \[ (0.525,\;0.715). \]
That matches the two-sided test decision at level \(0.05\).
This is the main structural point:
- interval view: values near \(0.62\) remain plausible
- test view: the specific null value \(0.5\) is too far away to remain compatible at level \(0.05\)
8 Computation Lens
A good workflow for standard one-parameter inference is:
- identify the parameter of interest
- write the estimator and its standard error
- choose a confidence level or significance level
- check assumptions or conditions
- compute either:
- an interval, if the question is estimation-focused
- a p-value or rejection decision, if the question is claim-focused
- translate the numerical result back into the original scientific or engineering question
This last step matters a lot. A correct z-score with a bad interpretation is still a bad conclusion.
9 Application Lens
In research practice, confidence intervals and tests help with:
- reporting uncertainty around benchmark differences
- judging whether an observed effect could be explained by sampling noise
- deciding whether a claimed improvement is both statistically and practically meaningful
- turning repeated-seed or repeated-run variation into a visible uncertainty summary
This is also where many paper-reading mistakes happen. A tiny p-value is not the same thing as a big effect, and a wide interval is often more informative than a bare “significant / not significant” label.
10 Stop Here For First Pass
If you can now explain:
- how to interpret a confidence interval correctly
- what a p-value means
- why “fail to reject” is weaker than “accept”
- why a two-sided test and a matching confidence interval often agree
then this page has done its main job.
11 Go Deeper
The most useful next steps after this page are:
- Regression and Classification Basics, where intervals and tests attach to fitted models and parameters
- Estimation and Bias-Variance if you want to revisit repeated-sampling behavior behind interval width
- Maximum Likelihood and Bayesian Basics if you want to contrast frequentist intervals/tests with posterior summaries
12 Optional Paper Bridge
- Penn State STAT 500 Lesson 5: Confidence Intervals -
First pass- official open lesson covering the structure and interpretation of confidence intervals. Checked2026-04-24. - Penn State STAT 500 Lesson 6: Hypothesis Testing -
First pass- official open lesson on test setup, p-values, decisions, and the CI/test relationship. Checked2026-04-24. - Penn State STAT 200 Section 6.6: Confidence Intervals & Hypothesis Testing -
Second pass- concise official reinforcement of when to use intervals versus tests. Checked2026-04-24. - MIT 18.05 Introduction to Statistics -
Second pass- official MIT notes with examples of confidence intervals, tests, and common interpretation pitfalls. Checked2026-04-24.
13 Optional After First Pass
If you want more practice before moving on:
- take one reported interval from a paper and rewrite its correct repeated-sampling interpretation
- compare a confidence interval with a p-value for the same parameter question
- ask whether a statistically significant result is also practically important in context
14 Common Mistakes
- saying the parameter has a 95% chance of lying in the computed confidence interval
- reading the p-value as the probability that the null hypothesis is true
- treating non-significance as proof of no effect
- confusing statistical significance with practical importance
- forgetting that CI/test equivalence is mainly for matching two-sided settings under the same assumptions
15 Exercises
- A 95% confidence interval for a population proportion is \((0.41, 0.53)\). What does this tell you about testing \(H_0:p=0.5\) versus \(H_A:p\neq0.5\) at level \(0.05\)?
- In one sentence, define a p-value without saying “probability the null is true.”
- Explain why a very large sample can make a tiny effect statistically significant.
16 Sources and Further Reading
- Penn State STAT 500 Lesson 5: Confidence Intervals -
First pass- official applied-statistics lesson on interval construction and interpretation. Checked2026-04-24. - Penn State STAT 500 Lesson 6: Hypothesis Testing -
First pass- official lesson on test setup, p-values, and decision rules. Checked2026-04-24. - Penn State STAT 200 Section 6.6: Confidence Intervals & Hypothesis Testing -
Second pass- compact official bridge between interval and testing viewpoints. Checked2026-04-24. - MIT 18.05 Introduction to Statistics -
Second pass- official MIT notes with good examples and cautionary interpretation points. Checked2026-04-24.
Sources checked online on 2026-04-24:
- Penn State STAT 500 Lesson 5
- Penn State STAT 500 Lesson 6
- Penn State STAT 200 Section 6.6
- MIT 18.05 Introduction to Statistics